kubectl get events --field-selector involvedObject.name=<pod-name>
Solution:
Verify NGC API key is correct:
kubectl get secret <auth-secret> -o jsonpath='{.data.NGC_API_KEY}' | base64 -d
Check image pull secrets are configured:
spec: image: pullSecrets: - nvcr-secret
Storage Issues
Problem: PVC not available or storage quota exceeded
kubectl get pvckubectl describe pvc <pvc-name>
Solution:
Ensure storage class exists: kubectl get storageclass
Check PVC status and resize if needed
Verify sufficient storage quota
3
Check Resource Constraints
# Check if pod is unschedulablekubectl get events --field-selector reason=FailedScheduling# Check node resourceskubectl top nodeskubectl describe node <node-name>
kubectl get nimservice <name> -o yaml | grep -A 20 "conditions:"
2
Check Pod Logs
# Get pod logskubectl logs <pod-name># For multi-container podskubectl logs <pod-name> -c <container-name># Get previous container logs if pod restartedkubectl logs <pod-name> --previous
3
Common Failure Scenarios
Model Loading Failures
Symptoms: Pod crashes during startup, logs show model loading errors
kubectl get pods -l app=<nimservice-name>kubectl describe pod <pod-name> | grep -A 10 "Readiness:"
2
Verify Health Endpoints
# Port forward to check health endpointskubectl port-forward <pod-name> 8000:8000# Check health endpointscurl http://localhost:8000/v1/health/livecurl http://localhost:8000/v1/health/ready
3
Adjust Probe Settings
If the service is slow to start, increase probe timeouts:
kubectl get service <nimservice-name>kubectl describe service <nimservice-name>
2
Check Endpoints
kubectl get endpoints <nimservice-name>
If endpoints are empty, pods aren’t ready or selectors don’t match.
3
Test Service Connectivity
# From within clusterkubectl run curl --image=curlimages/curl -it --rm -- \ curl http://<service-name>:8000/v1/health/ready# Via port-forwardkubectl port-forward svc/<service-name> 8000:8000curl http://localhost:8000/v1/health/ready
# For Ingresskubectl get ingress <nimservice-name>kubectl describe ingress <nimservice-name># For HTTPRoute (Gateway API)kubectl get httproute <nimservice-name>kubectl describe httproute <nimservice-name>
2
Verify Ingress Controller
# Check ingress controller is runningkubectl get pods -n ingress-nginx# Check ingress controller logskubectl logs -n ingress-nginx <controller-pod>
# Get all NIM resourceskubectl get nimservice,nimcache,nimpipeline -A# Watch resource changeskubectl get nimservice -w# Get detailed statuskubectl get nimservice <name> -o yaml
Pod Diagnostics
# Get pod statuskubectl get pods -l app=<name># Describe podkubectl describe pod <pod-name># Get logskubectl logs <pod-name> -f# Get previous logskubectl logs <pod-name> --previous# Execute commands in podkubectl exec <pod-name> -it -- bash
Events
# All events in namespacekubectl get events --sort-by='.lastTimestamp'# Events for specific resourcekubectl get events --field-selector involvedObject.name=<name># Warning events onlykubectl get events --field-selector type=Warning
Resource Usage
# Node resourceskubectl top nodes# Pod resourceskubectl top pods# Specific podkubectl top pod <pod-name># GPU usagekubectl describe nodes | grep -A 5 nvidia.com/gpu