Troubleshooting

This guide helps you troubleshoot common issues with the NVIDIA NIM Operator and resolve deployment problems.

Understanding Status Conditions

All NIM resources expose status conditions that help diagnose issues.

Status States

NIMService
NIMCache

State	Description	Next Action
`Ready`	Service is fully operational	Monitor performance
`NotReady`	Service exists but pods aren’t ready	Check pod status and logs
`Pending`	Resources are being created	Wait or check for scheduling issues
`Failed`	Deployment has failed	Check events and logs

State	Description	Next Action
`Ready`	Cache is complete and ready	Use in NIMService
`NotReady`	Cache exists but not usable	Check caching job status
`InProgress`	Caching is ongoing	Monitor job progress
`Failed`	Caching job failed	Check job logs
`PVCCreated`	PVC created, waiting for job	Check job creation
`Pending`	Waiting to start	Check resource availability
`Started`	Job has started	Monitor progress

Common NIMService Issues

Issue: NIMService Stuck in Pending State

Check Pod Status

kubectl get pods -l app=<nimservice-name>
kubectl describe pod <pod-name>

Common Causes and Solutions

Insufficient GPU Resources

Problem: Not enough GPUs available

# Check GPU availability
kubectl describe nodes | grep -A 5 "nvidia.com/gpu"

Solution:

Scale down other GPU workloads
Add more GPU nodes
Reduce GPU requests in your NIMService spec

Image Pull Errors

Problem: Cannot pull NIM container image

kubectl get events --field-selector involvedObject.name=<pod-name>

Solution:

Verify NGC API key is correct:

kubectl get secret <auth-secret> -o jsonpath='{.data.NGC_API_KEY}' | base64 -d

Check image pull secrets are configured:

spec:
  image:
    pullSecrets:
      - nvcr-secret

Storage Issues

Problem: PVC not available or storage quota exceeded

kubectl get pvc
kubectl describe pvc <pvc-name>

Solution:

Ensure storage class exists: kubectl get storageclass
Check PVC status and resize if needed
Verify sufficient storage quota

Check Resource Constraints

# Check if pod is unschedulable
kubectl get events --field-selector reason=FailedScheduling

# Check node resources
kubectl top nodes
kubectl describe node <node-name>

Issue: NIMService in Failed State

Inspect Status Conditions

kubectl get nimservice <name> -o yaml | grep -A 20 "conditions:"

Check Pod Logs

# Get pod logs
kubectl logs <pod-name>

# For multi-container pods
kubectl logs <pod-name> -c <container-name>

# Get previous container logs if pod restarted
kubectl logs <pod-name> --previous

Common Failure Scenarios

Model Loading Failures

Symptoms: Pod crashes during startup, logs show model loading errors

kubectl logs <pod-name> | grep -i "error\|failed\|exception"

Solutions:

Verify model cache is ready:

kubectl get nimcache <cache-name> -o jsonpath='{.status.state}'

Check NIM_CACHE_PATH is correctly mounted

Ensure sufficient shared memory:

spec:
  storage:
    sharedMemorySizeLimit: 8Gi

Out of Memory (OOM) Errors

Symptoms: Pod killed with exit code 137

kubectl describe pod <pod-name> | grep -A 5 "State:"

Solutions:

Increase memory limits:

spec:
  resources:
    limits:
      memory: 32Gi

Check actual memory usage: kubectl top pod <pod-name>
Increase shared memory size limit

NGC Authentication Errors

Symptoms: 401 Unauthorized errors in logsSolutions:

Verify NGC API key secret exists and is valid
Check secret is in the same namespace
Ensure authSecret field references correct secret name:
```
spec:
  authSecret: ngc-api-secret
```

Issue: NIMService Not Ready

Check Pod Readiness

kubectl get pods -l app=<nimservice-name>
kubectl describe pod <pod-name> | grep -A 10 "Readiness:"

Verify Health Endpoints

# Port forward to check health endpoints
kubectl port-forward <pod-name> 8000:8000

# Check health endpoints
curl http://localhost:8000/v1/health/live
curl http://localhost:8000/v1/health/ready

Adjust Probe Settings

If the service is slow to start, increase probe timeouts:

spec:
  startupProbe:
    probe:
      initialDelaySeconds: 60
      periodSeconds: 10
      failureThreshold: 180
  readinessProbe:
    probe:
      initialDelaySeconds: 30
      periodSeconds: 10
      failureThreshold: 3

Common NIMCache Issues

Issue: NIMCache Stuck in InProgress State

Check Caching Job Status

# Find the caching job
kubectl get jobs | grep <nimcache-name>

# Check job status
kubectl describe job <job-name>

Check Job Pod Logs

# Get pod from job
kubectl get pods -l job-name=<job-name>

# View logs
kubectl logs <pod-name> -f

Common Caching Issues

Download Failures

Problem: Cannot download model from NGCCheck:

kubectl logs <pod-name> | grep -i "download\|ngc\|network"

Solutions:

Verify NGC credentials
Check network connectivity

Configure proxy if needed:

spec:
  proxy:
    httpsProxy: http://proxy.example.com:3128

Insufficient Storage

Problem: Not enough disk space for modelCheck:

kubectl get pvc <pvc-name>
kubectl describe pvc <pvc-name>

Solution:

Increase PVC size:

spec:
  storage:
    pvc:
      size: 100Gi

Permission Errors

Problem: Cannot write to cache volumeCheck:

kubectl logs <pod-name> | grep -i "permission denied"

Solution:

Set correct userID and groupID:
```
spec:
  userID: 1000
  groupID: 2000
```

Issue: NIMCache Failed State

Check Job Failure Reason

kubectl describe job <job-name> | grep -A 10 "Conditions:"
kubectl get events --field-selector involvedObject.name=<job-name>

Review Job Logs

# Get failed pod logs
kubectl logs <pod-name> --previous

Delete and Recreate

If the issue is resolved, delete and recreate the NIMCache:

kubectl delete nimcache <name>
kubectl apply -f nimcache.yaml

Multi-Node NIMService Issues

Issue: LeaderWorkerSet Pods Not Starting

Check LeaderWorkerSet Status

kubectl get leaderworkerset <lws-name>
kubectl describe leaderworkerset <lws-name>

Verify MPI Configuration

# Check MPI ConfigMap
kubectl get configmap <nimservice-name>-mpi-config -o yaml

# Check SSH secrets
kubectl get secret <nimservice-name>-ssh-pk

Check Worker Pod Connectivity

# Check if workers can reach leader
kubectl exec <worker-pod> -- ping <leader-pod-hostname>

# Check SSH connectivity
kubectl logs <leader-pod> | grep -i "ssh\|mpi"

Networking Issues

Issue: Service Not Accessible

Verify Service Exists

kubectl get service <nimservice-name>
kubectl describe service <nimservice-name>

Check Endpoints

kubectl get endpoints <nimservice-name>

If endpoints are empty, pods aren’t ready or selectors don’t match.

Test Service Connectivity

# From within cluster
kubectl run curl --image=curlimages/curl -it --rm -- \
  curl http://<service-name>:8000/v1/health/ready

# Via port-forward
kubectl port-forward svc/<service-name> 8000:8000
curl http://localhost:8000/v1/health/ready

Issue: Ingress/Route Not Working

Check Ingress/HTTPRoute Resource

# For Ingress
kubectl get ingress <nimservice-name>
kubectl describe ingress <nimservice-name>

# For HTTPRoute (Gateway API)
kubectl get httproute <nimservice-name>
kubectl describe httproute <nimservice-name>

Verify Ingress Controller

# Check ingress controller is running
kubectl get pods -n ingress-nginx

# Check ingress controller logs
kubectl logs -n ingress-nginx <controller-pod>

Test DNS Resolution

nslookup <hostname>
curl -v http://<hostname>/v1/health/ready

Operator Issues

Issue: Operator Not Reconciling Resources

Check Operator Logs

kubectl logs -n nim-operator deployment/k8s-nim-operator-controller-manager -f

Verify CRDs Are Installed

kubectl get crd | grep apps.nvidia.com

Expected CRDs:

nimservices.apps.nvidia.com
nimcaches.apps.nvidia.com
nimpipelines.apps.nvidia.com
nemocustomizers.apps.nvidia.com
And other NeMo CRDs

Check Operator Health

kubectl get pods -n nim-operator
kubectl describe pod -n nim-operator <operator-pod>

Debug Commands Cheat Sheet

Resource Status

# Get all NIM resources
kubectl get nimservice,nimcache,nimpipeline -A

# Watch resource changes
kubectl get nimservice -w

# Get detailed status
kubectl get nimservice <name> -o yaml

Pod Diagnostics

# Get pod status
kubectl get pods -l app=<name>

# Describe pod
kubectl describe pod <pod-name>

# Get logs
kubectl logs <pod-name> -f

# Get previous logs
kubectl logs <pod-name> --previous

# Execute commands in pod
kubectl exec <pod-name> -it -- bash

Events

# All events in namespace
kubectl get events --sort-by='.lastTimestamp'

# Events for specific resource
kubectl get events --field-selector involvedObject.name=<name>

# Warning events only
kubectl get events --field-selector type=Warning

Resource Usage

# Node resources
kubectl top nodes

# Pod resources
kubectl top pods

# Specific pod
kubectl top pod <pod-name>

# GPU usage
kubectl describe nodes | grep -A 5 nvidia.com/gpu

Getting Help

Check Operator Logs

kubectl logs -n nim-operator \
  deployment/k8s-nim-operator-controller-manager -f

Collect Debug Information

# Resource definitions
kubectl get nimservice <name> -o yaml > nimservice.yaml

# Pod logs
kubectl logs <pod-name> > pod.log

# Events
kubectl get events > events.log

GitHub Issues

Report issues at github.com/NVIDIA/k8s-nim-operator

Documentation

See NVIDIA NIM Operator docs

When reporting issues, always include operator version, Kubernetes version, and relevant logs.

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

Understanding Status Conditions

Status States

Common NIMService Issues

Issue: NIMService Stuck in Pending State

Issue: NIMService in Failed State

Issue: NIMService Not Ready

Common NIMCache Issues

Issue: NIMCache Stuck in InProgress State

Issue: NIMCache Failed State

Multi-Node NIMService Issues

Issue: LeaderWorkerSet Pods Not Starting

Networking Issues

Issue: Service Not Accessible

Issue: Ingress/Route Not Working

Operator Issues

Issue: Operator Not Reconciling Resources

Debug Commands Cheat Sheet

Getting Help

Check Operator Logs

Collect Debug Information

GitHub Issues

Documentation

Next Steps

Monitoring

Upgrades

Build docs developers (and LLMs) love

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

​Understanding Status Conditions

​Status States

​Common NIMService Issues

​Issue: NIMService Stuck in Pending State

​Issue: NIMService in Failed State

​Issue: NIMService Not Ready

​Common NIMCache Issues

​Issue: NIMCache Stuck in InProgress State

​Issue: NIMCache Failed State

​Multi-Node NIMService Issues

​Issue: LeaderWorkerSet Pods Not Starting

​Networking Issues

​Issue: Service Not Accessible

​Issue: Ingress/Route Not Working

​Operator Issues

​Issue: Operator Not Reconciling Resources

​Debug Commands Cheat Sheet

​Getting Help

Check Operator Logs

Collect Debug Information

GitHub Issues

Documentation

​Next Steps

Monitoring

Upgrades

Build docs developers (and LLMs) love

Understanding Status Conditions

Status States

Common NIMService Issues

Issue: NIMService Stuck in Pending State

Issue: NIMService in Failed State

Issue: NIMService Not Ready

Common NIMCache Issues

Issue: NIMCache Stuck in InProgress State

Issue: NIMCache Failed State

Multi-Node NIMService Issues

Issue: LeaderWorkerSet Pods Not Starting

Networking Issues

Issue: Service Not Accessible

Issue: Ingress/Route Not Working

Operator Issues

Issue: Operator Not Reconciling Resources

Debug Commands Cheat Sheet

Getting Help

Next Steps