Kubernetes Troubleshooting Guide - Kubernetes Learning Guide

Pod Issues

Pod stuck in Pending state

Symptoms:

Pod shows Pending status indefinitely
Pods not being scheduled

Common Causes & Solutions:

Insufficient Resources
Node Selector/Affinity Mismatch
PVC Not Bound

Check cluster capacity:

kubectl describe nodes
kubectl top nodes

Look for:

CPU/Memory pressure
Allocatable vs requested resources

Solution:

Scale up cluster (add nodes)
Reduce resource requests
Enable cluster autoscaler

Check Pod spec:

kubectl describe pod <pod-name>

Look for:

nodeSelector that doesn’t match any nodes
Node affinity rules too restrictive
Taints on nodes without matching tolerations

Solution:

# Check node labels
kubectl get nodes --show-labels

# Check taints
kubectl describe node <node-name> | grep Taints

Check PVC status:

kubectl get pvc
kubectl describe pvc <pvc-name>

Solution:

Ensure PersistentVolume exists with matching storageClass
Check if dynamic provisioning is configured
Verify access modes match (RWO, RWX, ROX)

Pod stuck in CrashLoopBackOff

What it means:Container keeps crashing and Kubernetes keeps restarting it with exponential backoff.Debugging Steps:

# Check current logs
kubectl logs <pod-name> -n <namespace>

# Check previous container logs (crashed container)
kubectl logs <pod-name> -n <namespace> --previous

# Check all containers in Pod
kubectl logs <pod-name> -c <container-name> --previous

# Get detailed Pod info
kubectl describe pod <pod-name> -n <namespace>

Common Causes:

# Check for:
- Incorrect image tag
- Image doesn't exist in registry
- Wrong registry URL
- Authentication issues with private registry

# Verify:
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].image}'

Use --previous flag to see logs from the crashed container, not the current restarted one

Pod stuck in ImagePullBackOff

What it means:Kubernetes cannot pull the container image from registry.Debugging:

kubectl describe pod <pod-name>

Common Causes:

Image doesn’t exist
- Typo in image name/tag
- Image was deleted from registry

Authentication required

# Create docker-registry secret
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<username> \
  --docker-password=<password> \
  --docker-email=<email>

# Add to Pod spec
spec:
  imagePullSecrets:
  - name: regcred

Network issues
- Registry is unreachable from cluster
- Firewall blocking registry access
Rate limiting
- Docker Hub rate limits (100 pulls/6h for anonymous)
- Solution: Authenticate or use alternative registry

Pod stuck in Terminating state

What it means:Pod won’t delete and stays in Terminating status.Debugging:

kubectl describe pod <pod-name>

Common Causes:

Finalizers preventing deletion

# Check finalizers
kubectl get pod <pod-name> -o yaml | grep finalizers -A 5

# Remove finalizers (use with caution)
kubectl patch pod <pod-name> -p '{"metadata":{"finalizers":null}}'

PreStop hook hanging
- PreStop hook takes too long
- Increase terminationGracePeriodSeconds

Force delete (last resort)

kubectl delete pod <pod-name> --grace-period=0 --force

Force deletion can leave orphaned resources. Use only when necessary.

OOMKilled - Out of Memory

Symptoms:Pod terminated with exit code 137 or status OOMKilled.Check:

kubectl describe pod <pod-name>
# Look for: "Reason: OOMKilled"

kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

Solutions:

Increase memory limits

resources:
  limits:
    memory: "512Mi"  # Increase this
  requests:
    memory: "256Mi"

Fix memory leak in application
- Profile application memory usage
- Check for unbounded caches
- Review database connection pooling
Enable Vertical Pod Autoscaler
- VPA automatically adjusts resource requests/limits
Monitor memory usage
```
kubectl top pod <pod-name>
```

Configuration Issues

ConfigMap/Secret changes not reflecting in Pods

Why this happens:

Environment Variables
Volume Mounts
Immutable ConfigMaps

Problem:ConfigMaps/Secrets mounted as environment variables are NOT updated when changed.Solution:Restart the deployment:

kubectl rollout restart deployment <deployment-name>

Or force a rolling update:

kubectl patch deployment <deployment-name> \
  -p '{"spec":{"template":{"metadata":{"annotations":{"date":"'$(date +%s)'"}}}}}'

Behavior:ConfigMaps/Secrets mounted as volumes are updated automatically (with kubelet sync delay ~1 minute).Example:

spec:
  containers:
  - name: app
    volumeMounts:
    - name: config
      mountPath: /etc/config
      readOnly: true
  volumes:
  - name: config
    configMap:
      name: app-config

Note:

Application must watch file for changes and reload
Some apps require manual reload even with volume mounts

Problem:ConfigMap is marked as immutable:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
immutable: true  # Cannot be updated
data:
  key: value

Solution:Create new ConfigMap with different name:

kubectl create configmap app-config-v2 --from-file=config.yaml

Update Deployment to use new ConfigMap.

Service not routing traffic to Pods

Debugging Steps:

Check Service endpoints
```
kubectl get endpoints <service-name>
```
If no endpoints, Service selector doesn’t match Pod labels.

Verify Pod labels

# Get Service selector
kubectl get svc <service-name> -o jsonpath='{.spec.selector}'

# Get Pod labels
kubectl get pods --show-labels

Labels must match exactly.

Check Pod readiness
```
kubectl get pods
```
Pods must be Running and READY 1/1. If not ready, check readiness probe:
```
kubectl describe pod <pod-name> | grep -A 10 "Readiness"
```

Test Service connectivity

# From within cluster
kubectl run test --image=busybox -it --rm -- wget -O- <service-name>:<port>

# Port-forward to local machine
kubectl port-forward svc/<service-name> 8080:80
curl localhost:8080

Check network policies

kubectl get networkpolicy
kubectl describe networkpolicy <policy-name>

Networking Issues

DNS resolution not working

Symptoms:

Pods can’t resolve service names
nslookup fails inside Pod

Debugging:

Test DNS from Pod

kubectl run test --image=busybox -it --rm -- nslookup kubernetes.default

Check CoreDNS/kube-dns Pods

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

Verify DNS service

kubectl get svc -n kube-system kube-dns

Check Pod DNS config

kubectl exec <pod-name> -- cat /etc/resolv.conf

Should contain:

nameserver 10.96.0.10  # kube-dns service IP
search default.svc.cluster.local svc.cluster.local cluster.local

Common issues:
- CoreDNS not running
- Network policy blocking DNS traffic (port 53)
- Wrong CNI configuration

Ingress not routing traffic

Checklist:

Ingress Controller installed?

kubectl get pods -n ingress-nginx
# or
kubectl get pods -n kube-system | grep ingress

Ingress resource created?

kubectl get ingress
kubectl describe ingress <ingress-name>

Check Ingress address

kubectl get ingress <ingress-name>
# Should show ADDRESS column populated

Verify Service and endpoints exist
```
kubectl get svc
kubectl get endpoints
```

Check Ingress Controller logs

kubectl logs -n ingress-nginx deployment/ingress-nginx-controller

Test without Ingress

kubectl port-forward svc/<service-name> 8080:80
curl localhost:8080

If this works, issue is with Ingress configuration.

Common issues:
- DNS not pointing to Ingress load balancer
- Incorrect host in Ingress rules
- TLS certificate issues
- Path not matching (use pathType: Prefix)

Cross-namespace communication failing

Problem:Pod in namespace A can’t reach Service in namespace B.Solution:Use Fully Qualified Domain Name (FQDN):

<service-name>.<namespace>.svc.cluster.local

Example:

# From namespace 'dev' to service 'api' in namespace 'prod'
curl http://api.prod.svc.cluster.local:8080

Shorter form:

curl http://api.prod:8080

Check Network Policies:

# List all network policies
kubectl get networkpolicy --all-namespaces

# Check specific policy
kubectl describe networkpolicy <policy-name> -n <namespace>

Network policies might block cross-namespace traffic.

Storage Issues

PVC stuck in Pending

Causes:

No matching PersistentVolume
```
kubectl get pv
kubectl get pvc
```
Check:
- StorageClass matches
- Access modes compatible
- Sufficient capacity

StorageClass not found

kubectl get storageclass
kubectl describe pvc <pvc-name>

Dynamic provisioning not configured
- Check if CSI driver installed
- Verify cloud provider credentials
Node affinity mismatch (local volumes)
- PV has node affinity that doesn’t match any schedulable node

Solution:

# Ensure PVC and PV match:
# PVC:
storageClassName: fast
accessModes:
  - ReadWriteOnce
resources:
  requests:
    storage: 10Gi

# PV:
storageClassName: fast  # Must match
accessModes:
  - ReadWriteOnce  # Must be compatible
capacity:
  storage: 10Gi  # Must be >= PVC request

Pod can't mount volume

Error messages:

Unable to attach or mount volumes
Multi-Attach error
Volume is already exclusively attached

Common Causes:

ReadWriteOnce (RWO) volume already mounted
- RWO volumes can only be mounted by one node
- Check if volume is mounted by another Pod
```
# Find which Pod uses the PVC
kubectl get pods -o json | jq '.items[] | select(.spec.volumes[]?.persistentVolumeClaim.claimName=="<pvc-name>") | .metadata.name'
```
Solution: Delete old Pod first, or use ReadWriteMany (RWX) if supported.
Volume not detached from previous node
```
# Check VolumeAttachments
kubectl get volumeattachment
```
Solution: Wait for volume to detach, or manually delete VolumeAttachment.

Permission issues

kubectl logs <pod-name>
# Look for: "Permission denied"

Solution:

securityContext:
  fsGroup: 1000  # Match volume owner GID

Node Issues

Node in NotReady state

Check node status:

kubectl get nodes
kubectl describe node <node-name>

Common Causes:

kubelet not running

# SSH to node
systemctl status kubelet
systemctl restart kubelet
journalctl -u kubelet -f

Network plugin issues
- CNI plugin not installed or misconfigured
- Check pod network (Calico, Flannel, Weave)

Disk pressure

# Check disk usage on node
df -h

# Clean up Docker/containerd
docker system prune -a
# or
crictl rmi --prune

Memory/CPU pressure
```
# Check resource usage
top
free -m
```

Certificate expired

# Check kubelet logs for certificate errors
journalctl -u kubelet | grep certificate

Pods evicted from node

Reasons:

DiskPressure
- Node running out of disk space
- Check node conditions:
  kubectl describe node <node-name> | grep Conditions -A 10
Solution:
- Clean up unused images and containers
- Increase disk size
- Configure garbage collection
MemoryPressure
- Node running out of memory
- Kubelet starts evicting lowest priority Pods Solution:
- Set proper resource requests/limits
- Add more nodes
- Use memory-efficient applications
Node maintenance
- Manual cordon and drain

Check eviction events:

kubectl get events --sort-by='.lastTimestamp' | grep Evicted

Debugging Tools & Commands

Essential kubectl debugging commands

# Pod logs
kubectl logs <pod-name>
kubectl logs <pod-name> --previous  # Previous container
kubectl logs <pod-name> -c <container-name>  # Specific container
kubectl logs -f <pod-name>  # Follow/stream logs
kubectl logs --tail=100 <pod-name>  # Last 100 lines

# Describe resources (detailed info + events)
kubectl describe pod <pod-name>
kubectl describe node <node-name>
kubectl describe svc <service-name>

# Execute commands in container
kubectl exec <pod-name> -- ls /app
kubectl exec -it <pod-name> -- /bin/sh  # Interactive shell

# Port forwarding (test connectivity)
kubectl port-forward <pod-name> 8080:80
kubectl port-forward svc/<service-name> 8080:80

# Resource usage
kubectl top nodes
kubectl top pods
kubectl top pods --containers  # Per-container metrics

# Get events
kubectl get events --sort-by='.lastTimestamp'
kubectl get events --field-selector involvedObject.name=<pod-name>

# Get resource YAML/JSON
kubectl get pod <pod-name> -o yaml
kubectl get pod <pod-name> -o json | jq '.status'

# Debug with ephemeral container (K8s 1.23+)
kubectl debug <pod-name> -it --image=busybox

Network debugging tools

Create debug Pod:

kubectl run debug --image=nicolaka/netshoot -it --rm

Tools included:

curl, wget - HTTP requests
nslookup, dig - DNS debugging
ping, traceroute - Network connectivity
netstat, ss - Socket statistics
tcpdump - Packet capture
iperf - Network performance

Common tests:

# DNS resolution
nslookup kubernetes.default
dig +short <service-name>.<namespace>.svc.cluster.local

# HTTP connectivity
curl -v http://<service-name>:<port>
wget -O- http://<service-name>:<port>

# TCP connectivity
nc -zv <service-name> <port>
telnet <service-name> <port>

# Check routes
ip route

# Network interfaces
ip addr
ifconfig

Performance troubleshooting

Metrics Server (required for kubectl top):

# Install Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify
kubectl get deployment metrics-server -n kube-system

Resource usage:

# Node resource usage
kubectl top nodes

# Pod resource usage
kubectl top pods --all-namespaces
kubectl top pods --containers  # Per-container

# Sort by CPU
kubectl top pods --sort-by=cpu

# Sort by memory
kubectl top pods --sort-by=memory

Analyze resource requests vs limits:

kubectl get pods -o custom-columns=NAME:.metadata.name,CPU_REQUEST:.spec.containers[*].resources.requests.cpu,CPU_LIMIT:.spec.containers[*].resources.limits.cpu,MEMORY_REQUEST:.spec.containers[*].resources.requests.memory,MEMORY_LIMIT:.spec.containers[*].resources.limits.memory

Check cluster capacity:

kubectl describe nodes | grep -A 5 "Allocated resources"

Common Error Messages

Error: 'container is unhealthy, it will be killed and re-created'

Cause:Liveness probe failing.Debug:

kubectl describe pod <pod-name> | grep -A 10 Liveness
kubectl logs <pod-name>

Solutions:

Increase probe timeouts:

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30  # Increase
  periodSeconds: 10
  timeoutSeconds: 5  # Increase
  failureThreshold: 3  # Increase

Fix application health endpoint
- Ensure /health returns 200 OK
- Check app logs for errors

Use startup probe for slow-starting apps

startupProbe:
  httpGet:
    path: /health
    port: 8080
  failureThreshold: 30  # 30 * 10 = 300s max startup time
  periodSeconds: 10

Error: 'Back-off pulling image'

See Pod stuck in ImagePullBackOff above.Quick fixes:

# Verify image exists
docker pull <image-name>

# Check imagePullSecrets
kubectl get secret
kubectl describe pod <pod-name> | grep -A 5 "Image"

Error: 'rpc error: code = Unknown desc = Error: No such container'

Cause:Container runtime (Docker/containerd) issue.Solutions:

Check container runtime on node:

systemctl status docker
systemctl status containerd

Restart container runtime:

systemctl restart containerd
systemctl restart kubelet

Check runtime logs:
```
journalctl -u containerd -f
```

Best Practices for Troubleshooting

Systematic Approach:

Gather information - logs, events, describe output
Isolate the issue - Pod? Node? Network? Storage?
Check recent changes - deployments, config updates
Test incrementally - eliminate variables one by one
Document findings - help future troubleshooting

Before force-deleting or making destructive changes:

Save current state: kubectl get pod <name> -o yaml > backup.yaml
Check if issue affects only one Pod or multiple
Test in dev/staging environment first
Have rollback plan ready

Useful resources:

Kubernetes Troubleshooting Guide
kubectl Cheat Sheet
Community forums: Kubernetes Slack, Stack Overflow

Resources

​Pod Issues

​Configuration Issues

​Networking Issues

​Storage Issues

​Node Issues

​Debugging Tools & Commands

​Common Error Messages

​Best Practices for Troubleshooting

Build docs developers (and LLMs) love

Pod Issues

Configuration Issues

Networking Issues

Storage Issues

Node Issues

Debugging Tools & Commands

Common Error Messages

Best Practices for Troubleshooting