Skip to main content

Pod Issues

Symptoms:
  • Pod shows Pending status indefinitely
  • Pods not being scheduled
Common Causes & Solutions:
Check cluster capacity:
kubectl describe nodes
kubectl top nodes
Look for:
  • CPU/Memory pressure
  • Allocatable vs requested resources
Solution:
  • Scale up cluster (add nodes)
  • Reduce resource requests
  • Enable cluster autoscaler
What it means:Container keeps crashing and Kubernetes keeps restarting it with exponential backoff.Debugging Steps:
# Check current logs
kubectl logs <pod-name> -n <namespace>

# Check previous container logs (crashed container)
kubectl logs <pod-name> -n <namespace> --previous

# Check all containers in Pod
kubectl logs <pod-name> -c <container-name> --previous

# Get detailed Pod info
kubectl describe pod <pod-name> -n <namespace>
Common Causes:
# Check for:
- Incorrect image tag
- Image doesn't exist in registry
- Wrong registry URL
- Authentication issues with private registry

# Verify:
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].image}'
Use --previous flag to see logs from the crashed container, not the current restarted one
What it means:Kubernetes cannot pull the container image from registry.Debugging:
kubectl describe pod <pod-name>
Common Causes:
  1. Image doesn’t exist
    • Typo in image name/tag
    • Image was deleted from registry
  2. Authentication required
    # Create docker-registry secret
    kubectl create secret docker-registry regcred \
      --docker-server=<registry> \
      --docker-username=<username> \
      --docker-password=<password> \
      --docker-email=<email>
    
    # Add to Pod spec
    spec:
      imagePullSecrets:
      - name: regcred
    
  3. Network issues
    • Registry is unreachable from cluster
    • Firewall blocking registry access
  4. Rate limiting
    • Docker Hub rate limits (100 pulls/6h for anonymous)
    • Solution: Authenticate or use alternative registry
What it means:Pod won’t delete and stays in Terminating status.Debugging:
kubectl describe pod <pod-name>
Common Causes:
  1. Finalizers preventing deletion
    # Check finalizers
    kubectl get pod <pod-name> -o yaml | grep finalizers -A 5
    
    # Remove finalizers (use with caution)
    kubectl patch pod <pod-name> -p '{"metadata":{"finalizers":null}}'
    
  2. PreStop hook hanging
    • PreStop hook takes too long
    • Increase terminationGracePeriodSeconds
  3. Force delete (last resort)
    kubectl delete pod <pod-name> --grace-period=0 --force
    
Force deletion can leave orphaned resources. Use only when necessary.
Symptoms:Pod terminated with exit code 137 or status OOMKilled.Check:
kubectl describe pod <pod-name>
# Look for: "Reason: OOMKilled"

kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
Solutions:
  1. Increase memory limits
    resources:
      limits:
        memory: "512Mi"  # Increase this
      requests:
        memory: "256Mi"
    
  2. Fix memory leak in application
    • Profile application memory usage
    • Check for unbounded caches
    • Review database connection pooling
  3. Enable Vertical Pod Autoscaler
    • VPA automatically adjusts resource requests/limits
  4. Monitor memory usage
    kubectl top pod <pod-name>
    

Configuration Issues

Why this happens:
Problem:ConfigMaps/Secrets mounted as environment variables are NOT updated when changed.Solution:Restart the deployment:
kubectl rollout restart deployment <deployment-name>
Or force a rolling update:
kubectl patch deployment <deployment-name> \
  -p '{"spec":{"template":{"metadata":{"annotations":{"date":"'$(date +%s)'"}}}}}'
Debugging Steps:
  1. Check Service endpoints
    kubectl get endpoints <service-name>
    
    If no endpoints, Service selector doesn’t match Pod labels.
  2. Verify Pod labels
    # Get Service selector
    kubectl get svc <service-name> -o jsonpath='{.spec.selector}'
    
    # Get Pod labels
    kubectl get pods --show-labels
    
    Labels must match exactly.
  3. Check Pod readiness
    kubectl get pods
    
    Pods must be Running and READY 1/1. If not ready, check readiness probe:
    kubectl describe pod <pod-name> | grep -A 10 "Readiness"
    
  4. Test Service connectivity
    # From within cluster
    kubectl run test --image=busybox -it --rm -- wget -O- <service-name>:<port>
    
    # Port-forward to local machine
    kubectl port-forward svc/<service-name> 8080:80
    curl localhost:8080
    
  5. Check network policies
    kubectl get networkpolicy
    kubectl describe networkpolicy <policy-name>
    

Networking Issues

Symptoms:
  • Pods can’t resolve service names
  • nslookup fails inside Pod
Debugging:
  1. Test DNS from Pod
    kubectl run test --image=busybox -it --rm -- nslookup kubernetes.default
    
  2. Check CoreDNS/kube-dns Pods
    kubectl get pods -n kube-system -l k8s-app=kube-dns
    kubectl logs -n kube-system -l k8s-app=kube-dns
    
  3. Verify DNS service
    kubectl get svc -n kube-system kube-dns
    
  4. Check Pod DNS config
    kubectl exec <pod-name> -- cat /etc/resolv.conf
    
    Should contain:
    nameserver 10.96.0.10  # kube-dns service IP
    search default.svc.cluster.local svc.cluster.local cluster.local
    
  5. Common issues:
    • CoreDNS not running
    • Network policy blocking DNS traffic (port 53)
    • Wrong CNI configuration
Checklist:
  1. Ingress Controller installed?
    kubectl get pods -n ingress-nginx
    # or
    kubectl get pods -n kube-system | grep ingress
    
  2. Ingress resource created?
    kubectl get ingress
    kubectl describe ingress <ingress-name>
    
  3. Check Ingress address
    kubectl get ingress <ingress-name>
    # Should show ADDRESS column populated
    
  4. Verify Service and endpoints exist
    kubectl get svc
    kubectl get endpoints
    
  5. Check Ingress Controller logs
    kubectl logs -n ingress-nginx deployment/ingress-nginx-controller
    
  6. Test without Ingress
    kubectl port-forward svc/<service-name> 8080:80
    curl localhost:8080
    
    If this works, issue is with Ingress configuration.
  7. Common issues:
    • DNS not pointing to Ingress load balancer
    • Incorrect host in Ingress rules
    • TLS certificate issues
    • Path not matching (use pathType: Prefix)
Problem:Pod in namespace A can’t reach Service in namespace B.Solution:Use Fully Qualified Domain Name (FQDN):
<service-name>.<namespace>.svc.cluster.local
Example:
# From namespace 'dev' to service 'api' in namespace 'prod'
curl http://api.prod.svc.cluster.local:8080
Shorter form:
curl http://api.prod:8080
Check Network Policies:
# List all network policies
kubectl get networkpolicy --all-namespaces

# Check specific policy
kubectl describe networkpolicy <policy-name> -n <namespace>
Network policies might block cross-namespace traffic.

Storage Issues

Causes:
  1. No matching PersistentVolume
    kubectl get pv
    kubectl get pvc
    
    Check:
    • StorageClass matches
    • Access modes compatible
    • Sufficient capacity
  2. StorageClass not found
    kubectl get storageclass
    kubectl describe pvc <pvc-name>
    
  3. Dynamic provisioning not configured
    • Check if CSI driver installed
    • Verify cloud provider credentials
  4. Node affinity mismatch (local volumes)
    • PV has node affinity that doesn’t match any schedulable node
Solution:
# Ensure PVC and PV match:
# PVC:
storageClassName: fast
accessModes:
  - ReadWriteOnce
resources:
  requests:
    storage: 10Gi

# PV:
storageClassName: fast  # Must match
accessModes:
  - ReadWriteOnce  # Must be compatible
capacity:
  storage: 10Gi  # Must be >= PVC request
Error messages:
  • Unable to attach or mount volumes
  • Multi-Attach error
  • Volume is already exclusively attached
Common Causes:
  1. ReadWriteOnce (RWO) volume already mounted
    • RWO volumes can only be mounted by one node
    • Check if volume is mounted by another Pod
    # Find which Pod uses the PVC
    kubectl get pods -o json | jq '.items[] | select(.spec.volumes[]?.persistentVolumeClaim.claimName=="<pvc-name>") | .metadata.name'
    
    Solution: Delete old Pod first, or use ReadWriteMany (RWX) if supported.
  2. Volume not detached from previous node
    # Check VolumeAttachments
    kubectl get volumeattachment
    
    Solution: Wait for volume to detach, or manually delete VolumeAttachment.
  3. Permission issues
    kubectl logs <pod-name>
    # Look for: "Permission denied"
    
    Solution:
    securityContext:
      fsGroup: 1000  # Match volume owner GID
    

Node Issues

Check node status:
kubectl get nodes
kubectl describe node <node-name>
Common Causes:
  1. kubelet not running
    # SSH to node
    systemctl status kubelet
    systemctl restart kubelet
    journalctl -u kubelet -f
    
  2. Network plugin issues
    • CNI plugin not installed or misconfigured
    • Check pod network (Calico, Flannel, Weave)
  3. Disk pressure
    # Check disk usage on node
    df -h
    
    # Clean up Docker/containerd
    docker system prune -a
    # or
    crictl rmi --prune
    
  4. Memory/CPU pressure
    # Check resource usage
    top
    free -m
    
  5. Certificate expired
    # Check kubelet logs for certificate errors
    journalctl -u kubelet | grep certificate
    
Reasons:
  1. DiskPressure
    • Node running out of disk space
    • Check node conditions:
      kubectl describe node <node-name> | grep Conditions -A 10
      
    Solution:
    • Clean up unused images and containers
    • Increase disk size
    • Configure garbage collection
  2. MemoryPressure
    • Node running out of memory
    • Kubelet starts evicting lowest priority Pods Solution:
    • Set proper resource requests/limits
    • Add more nodes
    • Use memory-efficient applications
  3. Node maintenance
    • Manual cordon and drain
Check eviction events:
kubectl get events --sort-by='.lastTimestamp' | grep Evicted

Debugging Tools & Commands

# Pod logs
kubectl logs <pod-name>
kubectl logs <pod-name> --previous  # Previous container
kubectl logs <pod-name> -c <container-name>  # Specific container
kubectl logs -f <pod-name>  # Follow/stream logs
kubectl logs --tail=100 <pod-name>  # Last 100 lines

# Describe resources (detailed info + events)
kubectl describe pod <pod-name>
kubectl describe node <node-name>
kubectl describe svc <service-name>

# Execute commands in container
kubectl exec <pod-name> -- ls /app
kubectl exec -it <pod-name> -- /bin/sh  # Interactive shell

# Port forwarding (test connectivity)
kubectl port-forward <pod-name> 8080:80
kubectl port-forward svc/<service-name> 8080:80

# Resource usage
kubectl top nodes
kubectl top pods
kubectl top pods --containers  # Per-container metrics

# Get events
kubectl get events --sort-by='.lastTimestamp'
kubectl get events --field-selector involvedObject.name=<pod-name>

# Get resource YAML/JSON
kubectl get pod <pod-name> -o yaml
kubectl get pod <pod-name> -o json | jq '.status'

# Debug with ephemeral container (K8s 1.23+)
kubectl debug <pod-name> -it --image=busybox
Create debug Pod:
kubectl run debug --image=nicolaka/netshoot -it --rm
Tools included:
  • curl, wget - HTTP requests
  • nslookup, dig - DNS debugging
  • ping, traceroute - Network connectivity
  • netstat, ss - Socket statistics
  • tcpdump - Packet capture
  • iperf - Network performance
Common tests:
# DNS resolution
nslookup kubernetes.default
dig +short <service-name>.<namespace>.svc.cluster.local

# HTTP connectivity
curl -v http://<service-name>:<port>
wget -O- http://<service-name>:<port>

# TCP connectivity
nc -zv <service-name> <port>
telnet <service-name> <port>

# Check routes
ip route

# Network interfaces
ip addr
ifconfig
Metrics Server (required for kubectl top):
# Install Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify
kubectl get deployment metrics-server -n kube-system
Resource usage:
# Node resource usage
kubectl top nodes

# Pod resource usage
kubectl top pods --all-namespaces
kubectl top pods --containers  # Per-container

# Sort by CPU
kubectl top pods --sort-by=cpu

# Sort by memory
kubectl top pods --sort-by=memory
Analyze resource requests vs limits:
kubectl get pods -o custom-columns=NAME:.metadata.name,CPU_REQUEST:.spec.containers[*].resources.requests.cpu,CPU_LIMIT:.spec.containers[*].resources.limits.cpu,MEMORY_REQUEST:.spec.containers[*].resources.requests.memory,MEMORY_LIMIT:.spec.containers[*].resources.limits.memory
Check cluster capacity:
kubectl describe nodes | grep -A 5 "Allocated resources"

Common Error Messages

Cause:Liveness probe failing.Debug:
kubectl describe pod <pod-name> | grep -A 10 Liveness
kubectl logs <pod-name>
Solutions:
  1. Increase probe timeouts:
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 30  # Increase
      periodSeconds: 10
      timeoutSeconds: 5  # Increase
      failureThreshold: 3  # Increase
    
  2. Fix application health endpoint
    • Ensure /health returns 200 OK
    • Check app logs for errors
  3. Use startup probe for slow-starting apps
    startupProbe:
      httpGet:
        path: /health
        port: 8080
      failureThreshold: 30  # 30 * 10 = 300s max startup time
      periodSeconds: 10
    
See Pod stuck in ImagePullBackOff above.Quick fixes:
# Verify image exists
docker pull <image-name>

# Check imagePullSecrets
kubectl get secret
kubectl describe pod <pod-name> | grep -A 5 "Image"
Cause:Container runtime (Docker/containerd) issue.Solutions:
  1. Check container runtime on node:
    systemctl status docker
    systemctl status containerd
    
  2. Restart container runtime:
    systemctl restart containerd
    systemctl restart kubelet
    
  3. Check runtime logs:
    journalctl -u containerd -f
    

Best Practices for Troubleshooting

Systematic Approach:
  1. Gather information - logs, events, describe output
  2. Isolate the issue - Pod? Node? Network? Storage?
  3. Check recent changes - deployments, config updates
  4. Test incrementally - eliminate variables one by one
  5. Document findings - help future troubleshooting
Before force-deleting or making destructive changes:
  • Save current state: kubectl get pod <name> -o yaml > backup.yaml
  • Check if issue affects only one Pod or multiple
  • Test in dev/staging environment first
  • Have rollback plan ready
Useful resources:

Build docs developers (and LLMs) love