Skip to main content
This guide helps you troubleshoot common issues with the NVIDIA NIM Operator and resolve deployment problems.

Understanding Status Conditions

All NIM resources expose status conditions that help diagnose issues.

Status States

StateDescriptionNext Action
ReadyService is fully operationalMonitor performance
NotReadyService exists but pods aren’t readyCheck pod status and logs
PendingResources are being createdWait or check for scheduling issues
FailedDeployment has failedCheck events and logs

Common NIMService Issues

Issue: NIMService Stuck in Pending State

1

Check Pod Status

kubectl get pods -l app=<nimservice-name>
kubectl describe pod <pod-name>
2

Common Causes and Solutions

Problem: Not enough GPUs available
# Check GPU availability
kubectl describe nodes | grep -A 5 "nvidia.com/gpu"
Solution:
  • Scale down other GPU workloads
  • Add more GPU nodes
  • Reduce GPU requests in your NIMService spec
Problem: Cannot pull NIM container image
kubectl get events --field-selector involvedObject.name=<pod-name>
Solution:
  • Verify NGC API key is correct:
    kubectl get secret <auth-secret> -o jsonpath='{.data.NGC_API_KEY}' | base64 -d
    
  • Check image pull secrets are configured:
    spec:
      image:
        pullSecrets:
          - nvcr-secret
    
Problem: PVC not available or storage quota exceeded
kubectl get pvc
kubectl describe pvc <pvc-name>
Solution:
  • Ensure storage class exists: kubectl get storageclass
  • Check PVC status and resize if needed
  • Verify sufficient storage quota
3

Check Resource Constraints

# Check if pod is unschedulable
kubectl get events --field-selector reason=FailedScheduling

# Check node resources
kubectl top nodes
kubectl describe node <node-name>

Issue: NIMService in Failed State

1

Inspect Status Conditions

kubectl get nimservice <name> -o yaml | grep -A 20 "conditions:"
2

Check Pod Logs

# Get pod logs
kubectl logs <pod-name>

# For multi-container pods
kubectl logs <pod-name> -c <container-name>

# Get previous container logs if pod restarted
kubectl logs <pod-name> --previous
3

Common Failure Scenarios

Symptoms: Pod crashes during startup, logs show model loading errors
kubectl logs <pod-name> | grep -i "error\|failed\|exception"
Solutions:
  • Verify model cache is ready:
    kubectl get nimcache <cache-name> -o jsonpath='{.status.state}'
    
  • Check NIM_CACHE_PATH is correctly mounted
  • Ensure sufficient shared memory:
    spec:
      storage:
        sharedMemorySizeLimit: 8Gi
    
Symptoms: Pod killed with exit code 137
kubectl describe pod <pod-name> | grep -A 5 "State:"
Solutions:
  • Increase memory limits:
    spec:
      resources:
        limits:
          memory: 32Gi
    
  • Check actual memory usage: kubectl top pod <pod-name>
  • Increase shared memory size limit
Symptoms: 401 Unauthorized errors in logsSolutions:
  • Verify NGC API key secret exists and is valid
  • Check secret is in the same namespace
  • Ensure authSecret field references correct secret name:
    spec:
      authSecret: ngc-api-secret
    

Issue: NIMService Not Ready

1

Check Pod Readiness

kubectl get pods -l app=<nimservice-name>
kubectl describe pod <pod-name> | grep -A 10 "Readiness:"
2

Verify Health Endpoints

# Port forward to check health endpoints
kubectl port-forward <pod-name> 8000:8000

# Check health endpoints
curl http://localhost:8000/v1/health/live
curl http://localhost:8000/v1/health/ready
3

Adjust Probe Settings

If the service is slow to start, increase probe timeouts:
spec:
  startupProbe:
    probe:
      initialDelaySeconds: 60
      periodSeconds: 10
      failureThreshold: 180
  readinessProbe:
    probe:
      initialDelaySeconds: 30
      periodSeconds: 10
      failureThreshold: 3

Common NIMCache Issues

Issue: NIMCache Stuck in InProgress State

1

Check Caching Job Status

# Find the caching job
kubectl get jobs | grep <nimcache-name>

# Check job status
kubectl describe job <job-name>
2

Check Job Pod Logs

# Get pod from job
kubectl get pods -l job-name=<job-name>

# View logs
kubectl logs <pod-name> -f
3

Common Caching Issues

Problem: Cannot download model from NGCCheck:
kubectl logs <pod-name> | grep -i "download\|ngc\|network"
Solutions:
  • Verify NGC credentials
  • Check network connectivity
  • Configure proxy if needed:
    spec:
      proxy:
        httpsProxy: http://proxy.example.com:3128
    
Problem: Not enough disk space for modelCheck:
kubectl get pvc <pvc-name>
kubectl describe pvc <pvc-name>
Solution:
  • Increase PVC size:
    spec:
      storage:
        pvc:
          size: 100Gi
    
Problem: Cannot write to cache volumeCheck:
kubectl logs <pod-name> | grep -i "permission denied"
Solution:
  • Set correct userID and groupID:
    spec:
      userID: 1000
      groupID: 2000
    

Issue: NIMCache Failed State

1

Check Job Failure Reason

kubectl describe job <job-name> | grep -A 10 "Conditions:"
kubectl get events --field-selector involvedObject.name=<job-name>
2

Review Job Logs

# Get failed pod logs
kubectl logs <pod-name> --previous
3

Delete and Recreate

If the issue is resolved, delete and recreate the NIMCache:
kubectl delete nimcache <name>
kubectl apply -f nimcache.yaml

Multi-Node NIMService Issues

Issue: LeaderWorkerSet Pods Not Starting

1

Check LeaderWorkerSet Status

kubectl get leaderworkerset <lws-name>
kubectl describe leaderworkerset <lws-name>
2

Verify MPI Configuration

# Check MPI ConfigMap
kubectl get configmap <nimservice-name>-mpi-config -o yaml

# Check SSH secrets
kubectl get secret <nimservice-name>-ssh-pk
3

Check Worker Pod Connectivity

# Check if workers can reach leader
kubectl exec <worker-pod> -- ping <leader-pod-hostname>

# Check SSH connectivity
kubectl logs <leader-pod> | grep -i "ssh\|mpi"

Networking Issues

Issue: Service Not Accessible

1

Verify Service Exists

kubectl get service <nimservice-name>
kubectl describe service <nimservice-name>
2

Check Endpoints

kubectl get endpoints <nimservice-name>
If endpoints are empty, pods aren’t ready or selectors don’t match.
3

Test Service Connectivity

# From within cluster
kubectl run curl --image=curlimages/curl -it --rm -- \
  curl http://<service-name>:8000/v1/health/ready

# Via port-forward
kubectl port-forward svc/<service-name> 8000:8000
curl http://localhost:8000/v1/health/ready

Issue: Ingress/Route Not Working

1

Check Ingress/HTTPRoute Resource

# For Ingress
kubectl get ingress <nimservice-name>
kubectl describe ingress <nimservice-name>

# For HTTPRoute (Gateway API)
kubectl get httproute <nimservice-name>
kubectl describe httproute <nimservice-name>
2

Verify Ingress Controller

# Check ingress controller is running
kubectl get pods -n ingress-nginx

# Check ingress controller logs
kubectl logs -n ingress-nginx <controller-pod>
3

Test DNS Resolution

nslookup <hostname>
curl -v http://<hostname>/v1/health/ready

Operator Issues

Issue: Operator Not Reconciling Resources

1

Check Operator Logs

kubectl logs -n nim-operator deployment/k8s-nim-operator-controller-manager -f
2

Verify CRDs Are Installed

kubectl get crd | grep apps.nvidia.com
Expected CRDs:
  • nimservices.apps.nvidia.com
  • nimcaches.apps.nvidia.com
  • nimpipelines.apps.nvidia.com
  • nemocustomizers.apps.nvidia.com
  • And other NeMo CRDs
3

Check Operator Health

kubectl get pods -n nim-operator
kubectl describe pod -n nim-operator <operator-pod>

Debug Commands Cheat Sheet

# Get all NIM resources
kubectl get nimservice,nimcache,nimpipeline -A

# Watch resource changes
kubectl get nimservice -w

# Get detailed status
kubectl get nimservice <name> -o yaml
# Get pod status
kubectl get pods -l app=<name>

# Describe pod
kubectl describe pod <pod-name>

# Get logs
kubectl logs <pod-name> -f

# Get previous logs
kubectl logs <pod-name> --previous

# Execute commands in pod
kubectl exec <pod-name> -it -- bash
# All events in namespace
kubectl get events --sort-by='.lastTimestamp'

# Events for specific resource
kubectl get events --field-selector involvedObject.name=<name>

# Warning events only
kubectl get events --field-selector type=Warning
# Node resources
kubectl top nodes

# Pod resources
kubectl top pods

# Specific pod
kubectl top pod <pod-name>

# GPU usage
kubectl describe nodes | grep -A 5 nvidia.com/gpu

Getting Help

Check Operator Logs

kubectl logs -n nim-operator \
  deployment/k8s-nim-operator-controller-manager -f

Collect Debug Information

# Resource definitions
kubectl get nimservice <name> -o yaml > nimservice.yaml

# Pod logs
kubectl logs <pod-name> > pod.log

# Events
kubectl get events > events.log

GitHub Issues

Documentation

When reporting issues, always include operator version, Kubernetes version, and relevant logs.

Next Steps

Monitoring

Set up monitoring and observability

Upgrades

Learn how to upgrade the operator

Build docs developers (and LLMs) love