Troubleshooting

Quick Debug Command

The debug-k8s command provides instant cluster health overview:

debug-k8s

Output (devenv.nix:130-135):

=== Pod status ===
NAMESPACE       NAME                              READY   STATUS    RESTARTS   AGE
observability   grafana-abc123                    1/1     Running   0          5m
observability   prometheus-xyz789                 1/1     Running   0          5m
...

=== Recent events ===
observability   5m   Normal   Scheduled   pod/grafana-abc123   Successfully assigned
observability   5m   Normal   Pulling     pod/grafana-abc123   Pulling image
...

What it shows:

All pods across all namespaces
Last 10 events sorted by timestamp
Quick way to spot crashes, image pull failures, scheduling issues

Common Issues

Cluster Won’t Start

Symptoms:

kubectl commands hang
cluster-start times out
“connection refused” errors

Diagnosis:

# Check Docker containers
docker ps -a --filter "label=io.x-k8s.kind.cluster=microservice-infra"

# Check container logs
docker logs microservice-infra-control-plane

# Check API server health
curl -k https://localhost:6443/healthz

Solutions:

Restart cluster:
```
cluster-stop
cluster-start
```
Full rebuild:
```
cluster-down
bootstrap --clean
```
Check Docker daemon:
```
systemctl status docker
docker info
```

Pod Stuck in Pending

Symptoms:

Pod shows Pending status
Never transitions to Running

Diagnosis:

# Describe pod to see events
kubectl describe pod <pod-name> -n <namespace>

# Check events specifically
kubectl get events -n <namespace> --sort-by=.lastTimestamp

# Check node resources
kubectl top nodes
kubectl describe nodes

Common causes:

Insufficient resources:

Warning  FailedScheduling  pod/myapp  0/1 nodes available: Insufficient cpu

Solution: Reduce resource requests or add nodes

Image pull failure:

Warning  Failed  pod/myapp  Failed to pull image "myimage:latest"

Solution: Check image name, load into Kind:

docker pull myimage:latest
kind load docker-image myimage:latest --name microservice-infra

PVC not bound:

Warning  FailedMount  pod/myapp  persistentvolumeclaim "data" not found

Solution: Check PVC status:

kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>

Pod CrashLoopBackOff

Symptoms:

Pod status: CrashLoopBackOff
Restart count increasing

Diagnosis:

# Check logs from current container
kubectl logs <pod-name> -n <namespace>

# Check logs from previous crash
kubectl logs <pod-name> -n <namespace> --previous

# Follow logs in real-time
kubectl logs <pod-name> -n <namespace> -f

# Check container exit code
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'

Common causes:

Application error:
- Check logs for stack traces
- Verify configuration (env vars, secrets)
- Test application locally
Missing dependencies:
- Database not ready
- Secret not created Solution: Add init containers or readiness probes

Liveness probe failing:

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 60  # Give app time to start

Image Pull Errors

Symptoms:

ErrImagePull or ImagePullBackOff
Pod can’t download container image

Diagnosis:

# Describe pod for full error
kubectl describe pod <pod-name> -n <namespace>

# Check image name
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'

Solutions:

Load local image into Kind:

docker pull <image:tag>
kind load docker-image <image:tag> --name microservice-infra

Fix image name:
- Check for typos
- Verify tag exists
- Ensure registry is accessible
Use custom OTel Collector (if applicable):
```
load-otel-collector-image
```

Service Not Reachable

Symptoms:

Can’t access service via NodePort or ClusterIP
Connection timeout or refused

Diagnosis:

# Check service
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>

# Verify endpoints
kubectl get endpoints <service-name> -n <namespace>

# Test from within cluster
kubectl run curl --image=curlimages/curl -it --rm -- curl http://<service-name>.<namespace>:8080

Common causes:

No endpoints (no pods match selector):

# Check selector
kubectl get svc <service-name> -n <namespace> -o jsonpath='{.spec.selector}'

# Check pod labels
kubectl get pods -n <namespace> --show-labels

Wrong port:
- Verify service port matches container port
- Check NodePort range (30000-32000)

Pod not ready:

kubectl get pods -n <namespace>
# If 0/1 READY, check readiness probe

DNS Resolution Failing

Symptoms:

“Name or service not known”
Can’t resolve service names

Diagnosis:

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Test DNS from pod
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

Solutions:

Restart CoreDNS:

kubectl rollout restart deployment/coredns -n kube-system

Verify DNS service:

kubectl get svc -n kube-system kube-dns

Check pod DNS config:

kubectl exec <pod-name> -n <namespace> -- cat /etc/resolv.conf

Persistent Volume Issues

Symptoms:

PVC stuck in Pending
“no persistent volumes available”

Diagnosis:

# Check PVC status
kubectl get pvc -A

# Describe PVC
kubectl describe pvc <pvc-name> -n <namespace>

# Check available PVs
kubectl get pv

Solutions:

For Kind (hostPath):

Volumes are automatically provisioned

Check storage class:

kubectl get storageclass
kubectl get pvc <pvc-name> -n <namespace> -o jsonpath='{.spec.storageClassName}'

Create manual PV (if needed):

apiVersion: v1
kind: PersistentVolume
metadata:
  name: manual-pv
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /data/manual-pv

Manifest Apply Failures

Symptoms:

kubectl apply returns error
Resources not created/updated

Diagnosis:

# Dry-run to check validity
kubectl apply --dry-run=client -f manifests/myapp/

# Server-side dry-run
kubectl apply --dry-run=server -f manifests/myapp/

# Check resource validation
kubectl explain <resource>.<field>

Common causes:

CRD not installed:

error: unable to recognize "file.yaml": no matches for kind "Prometheus"

Solution: Apply CRDs first:

kubectl apply -f manifests/kube-prometheus-stack/CustomResourceDefinition-*.yaml

Field immutable:

The Service "myapp" is invalid: spec.clusterIP: Invalid value: ""

Solution: Delete and recreate:

kubectl delete svc myapp -n <namespace>
kubectl apply -f manifests/myapp/Service-myapp.yaml

Server-side apply conflict:

Apply failed with 1 conflict

Solution: Force conflicts:

kubectl apply -f file.yaml --server-side --force-conflicts

Advanced Debugging

Interactive Pod Debugging

Create debug pod in same namespace:

# Alpine with network tools
kubectl run debug -it --rm --image=alpine -n <namespace> -- sh

# Inside pod:
apk add curl bind-tools
nslookup myservice
curl http://myservice:8080/health

Exec into Running Pod

# Get shell
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

# Run single command
kubectl exec <pod-name> -n <namespace> -- env | grep DATABASE

# For multi-container pod
kubectl exec -it <pod-name> -n <namespace> -c <container-name> -- /bin/sh

Port Forward for Local Access

# Forward pod port to localhost
kubectl port-forward pod/<pod-name> 8080:8080 -n <namespace>

# Forward service
kubectl port-forward svc/<service-name> 8080:80 -n <namespace>

# Then access: http://localhost:8080

Copy Files To/From Pod

# Copy from pod
kubectl cp <namespace>/<pod-name>:/path/to/file ./local-file

# Copy to pod
kubectl cp ./local-file <namespace>/<pod-name>:/path/to/file

Analyze Resource Usage

# Node resource usage
kubectl top nodes

# Pod resource usage
kubectl top pods -A --sort-by=memory
kubectl top pods -A --sort-by=cpu

# Specific namespace
kubectl top pods -n observability

Watch Resources in Real-Time

# Watch pod status
kubectl get pods -n <namespace> -w

# Watch events
kubectl get events -n <namespace> -w --sort-by=.lastTimestamp

# Watch with custom columns
kubectl get pods -n <namespace> -w -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount

Bootstrap-Specific Issues

Bootstrap Hangs

Diagnosis:

# Check which step is hanging
kubectl get pods -A

# Check events
kubectl get events -A --sort-by=.lastTimestamp | tail -20

# Check bootstrap state
ls -la .bootstrap-state/
cat .bootstrap-state/cluster
cat .bootstrap-state/manifest

Solutions:

Kill and restart:
```
# Ctrl+C to stop
bootstrap
```
Clean bootstrap:
```
bootstrap --clean
```

Manual cleanup:

cluster-down
rm -rf .bootstrap-state/
docker system prune -f
bootstrap

Warm Cluster Not Detecting Changes

Symptoms:

Changed manifests not applied
“All good” message but resources outdated

Diagnosis:

# Check stored hashes
cat .bootstrap-state/manifest

# Compute current hash
find manifests-result -type f | sort | xargs cat | shasum -a 256

Solution: Force regeneration

bootstrap --clean

Garage Setup Fails

Symptoms:

Bootstrap fails at “Running Garage setup”
Loki/Tempo can’t connect to storage

Diagnosis:

# Check Garage pods
kubectl get pods -n storage

# Check Garage logs
kubectl logs -n storage -l app.kubernetes.io/name=garage

# Test Garage API
kubectl exec -n storage -it <garage-pod> -- garage status

Solution: Re-run setup

bash scripts/garage-setup.sh

Log Analysis

Grep Logs for Errors

# All pods in namespace
kubectl logs -n <namespace> --all-containers=true | grep -i error

# Specific label
kubectl logs -n <namespace> -l app=myapp --tail=100 | grep -i "exception\|error\|fatal"

Follow Multiple Pods

# All pods matching label
kubectl logs -n <namespace> -l app=myapp -f --max-log-requests=10

Export Logs for Analysis

# Single pod
kubectl logs <pod-name> -n <namespace> > pod.log

# All pods
for pod in $(kubectl get pods -n <namespace> -o name); do
  kubectl logs -n <namespace> $pod > "${pod##*/}.log"
done

Getting Help

Collect Diagnostic Info

# Cluster info
kubectl cluster-info dump > cluster-dump.txt

# All resources
kubectl get all -A > all-resources.txt

# Events
kubectl get events -A --sort-by=.lastTimestamp > events.txt

# Node status
kubectl describe nodes > nodes.txt

Check Component Versions

# Kubernetes version
kubectl version

# Kind version
kind version

# Cilium version (if installed)
cilium version

# Helm releases
helm list -A

Useful kubectl Plugins

# Install krew (kubectl plugin manager)
# https://krew.sigs.k8s.io/docs/user-guide/setup/install/

# Install useful plugins
kubectl krew install tree      # View resources as tree
kubectl krew install neat      # Clean up resource output
kubectl krew install tail      # Tail logs from multiple pods

Getting Started

Bootstrap Modes

Architecture

Operations

Components

Development

Quick Debug Command

Common Issues

Cluster Won’t Start

Pod Stuck in Pending

Pod CrashLoopBackOff

Image Pull Errors

Service Not Reachable

DNS Resolution Failing

Persistent Volume Issues

Manifest Apply Failures

Advanced Debugging

Interactive Pod Debugging

Exec into Running Pod

Port Forward for Local Access

Copy Files To/From Pod

Analyze Resource Usage

Watch Resources in Real-Time

Bootstrap-Specific Issues

Bootstrap Hangs

Warm Cluster Not Detecting Changes

Garage Setup Fails

Log Analysis

Grep Logs for Errors

Follow Multiple Pods

Export Logs for Analysis

Getting Help

Collect Diagnostic Info

Check Component Versions

Useful kubectl Plugins

Build docs developers (and LLMs) love

Getting Started

Bootstrap Modes

Architecture

Operations

Components

Development

​Quick Debug Command

​Common Issues

​Cluster Won’t Start

​Pod Stuck in Pending

​Pod CrashLoopBackOff

​Image Pull Errors

​Service Not Reachable

​DNS Resolution Failing

​Persistent Volume Issues

​Manifest Apply Failures

​Advanced Debugging

​Interactive Pod Debugging

​Exec into Running Pod

​Port Forward for Local Access

​Copy Files To/From Pod

​Analyze Resource Usage

​Watch Resources in Real-Time

​Bootstrap-Specific Issues

​Bootstrap Hangs

​Warm Cluster Not Detecting Changes

​Garage Setup Fails

​Log Analysis

​Grep Logs for Errors

​Follow Multiple Pods

​Export Logs for Analysis

​Getting Help

​Collect Diagnostic Info

​Check Component Versions

​Useful kubectl Plugins

Build docs developers (and LLMs) love

Quick Debug Command

Common Issues

Cluster Won’t Start

Pod Stuck in Pending

Pod CrashLoopBackOff

Image Pull Errors

Service Not Reachable

DNS Resolution Failing

Persistent Volume Issues

Manifest Apply Failures

Advanced Debugging

Interactive Pod Debugging

Exec into Running Pod

Port Forward for Local Access

Copy Files To/From Pod

Analyze Resource Usage

Watch Resources in Real-Time

Bootstrap-Specific Issues

Bootstrap Hangs

Warm Cluster Not Detecting Changes

Garage Setup Fails

Log Analysis

Grep Logs for Errors

Follow Multiple Pods

Export Logs for Analysis

Getting Help

Collect Diagnostic Info

Check Component Versions

Useful kubectl Plugins