This guide covers upgrading the NVIDIA NIM Operator, managing CRD updates, and rolling back if needed.
Overview
The NVIDIA NIM Operator follows semantic versioning. Upgrades may include:
Operator controller image updates
CRD schema changes
New features and capabilities
Bug fixes and security patches
Always review the release notes before upgrading to understand breaking changes and new features.
Pre-Upgrade Checklist
Before upgrading, complete these steps:
Review Release Notes
Check the release notes for:
Breaking changes
New features
Deprecated fields
Required actions
Backup Current State
# Backup all NIM resources
kubectl get nimservice -A -o yaml > nimservices-backup.yaml
kubectl get nimcache -A -o yaml > nimcaches-backup.yaml
kubectl get nimpipeline -A -o yaml > nimpipelines-backup.yaml
# Backup operator deployment
kubectl get deployment -n nim-operator -o yaml > operator-backup.yaml
# Backup CRDs
kubectl get crd -o yaml | grep -A 1000 "apps.nvidia.com" > crds-backup.yaml
Verify Current Version
# Check operator version
kubectl get deployment -n nim-operator k8s-nim-operator-controller-manager \
-o jsonpath='{.spec.template.spec.containers[0].image}'
# Check CRD versions
kubectl get crd nimservices.apps.nvidia.com \
-o jsonpath='{.spec.versions[*].name}'
Check Resource Health
# Verify all resources are healthy
kubectl get nimservice,nimcache,nimpipeline -A
# Check for any failed resources
kubectl get nimservice -A -o jsonpath='{.items[?(@.status.state=="Failed")].metadata.name}'
Review Cluster Capacity
# Check node resources
kubectl top nodes
# Check operator pod resources
kubectl top pod -n nim-operator
Upgrading with Helm
The recommended method for upgrading is using Helm.
Standard Upgrade
Update Helm Repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
Check Available Versions
helm search repo nvidia/k8s-nim-operator --versions
Review Chart Changes
# Compare current values with new version
helm get values k8s-nim-operator -n nim-operator > current-values.yaml
helm show values nvidia/k8s-nim-operator --version < new-versio n > > new-values.yaml
diff current-values.yaml new-values.yaml
Perform Upgrade
helm upgrade k8s-nim-operator nvidia/k8s-nim-operator \
--namespace nim-operator \
--version < new-versio n > \
--values current-values.yaml \
--wait
The --wait flag ensures Helm waits for the upgrade to complete before returning.
Verify Upgrade
# Check operator version
helm list -n nim-operator
# Verify operator is running
kubectl get pods -n nim-operator
# Check operator logs
kubectl logs -n nim-operator deployment/k8s-nim-operator-controller-manager -f
Upgrade with CRD Management
By default, the Helm chart can automatically upgrade CRDs.
# values.yaml
operator :
upgradeCRD : true # Enable automatic CRD upgrades
Upgrade with CRD updates:
helm upgrade k8s-nim-operator nvidia/k8s-nim-operator \
--namespace nim-operator \
--version < new-versio n > \
--set operator.upgradeCRD= true \
--wait
CRD upgrades are applied immediately and affect all custom resources. Test in a non-production environment first.
Upgrading with kubectl
For manual installations, use kubectl to upgrade.
Update CRDs First
Download New CRDs
# Download CRD manifests for new version
curl -sL https://github.com/NVIDIA/k8s-nim-operator/releases/download/ < versio n > /crds.yaml \
-o crds-new.yaml
Review CRD Changes
# Compare CRDs
diff crds-backup.yaml crds-new.yaml
Apply New CRDs
kubectl apply -f crds-new.yaml
CRDs are cluster-scoped resources. You need cluster-admin permissions to update them.
Verify CRD Updates
kubectl get crd nimservices.apps.nvidia.com -o yaml
Update Operator Deployment
Update Deployment Manifest
# Download new operator manifest
curl -sL https://github.com/NVIDIA/k8s-nim-operator/releases/download/ < versio n > /install.yaml \
-o install-new.yaml
Apply Updated Manifest
kubectl apply -f install-new.yaml
Monitor Rollout
kubectl rollout status deployment/k8s-nim-operator-controller-manager -n nim-operator
Managing CRD Versions
Understanding CRD Versioning
The operator uses API versioning (e.g., v1alpha1) for CRDs:
kubectl get crd nimservices.apps.nvidia.com -o jsonpath='{.spec.versions[*].name}'
Checking for Deprecated Fields
After upgrading, check for deprecated fields in your resources:
# Check for deprecation warnings
kubectl get nimservice < nam e > -o yaml | grep -i deprecated
# Review operator logs for warnings
kubectl logs -n nim-operator deployment/k8s-nim-operator-controller-manager | grep -i deprecated
Migrating Resources
If the upgrade introduces breaking changes, migrate your resources:
Identify Affected Resources
# Get all resources
kubectl get nimservice -A -o yaml > all-nimservices.yaml
Update Resource Definitions
Edit the YAML files to update deprecated fields according to the migration guide in the release notes.
Apply Updated Resources
kubectl apply -f all-nimservices-updated.yaml
Verify Migration
kubectl get nimservice -A
Rollback Procedures
If issues occur during upgrade, you can roll back.
Rollback with Helm
List Helm History
helm history k8s-nim-operator -n nim-operator
Rollback to Previous Version
helm rollback k8s-nim-operator < revisio n > -n nim-operator --wait
Or rollback to previous revision: helm rollback k8s-nim-operator -n nim-operator --wait
Verify Rollback
helm list -n nim-operator
kubectl get pods -n nim-operator
Manual Rollback
Restore CRDs (if needed)
kubectl apply -f crds-backup.yaml
Rolling back CRDs may cause issues if new resources use updated schemas. Test carefully.
Restore Operator Deployment
kubectl apply -f operator-backup.yaml
Monitor Rollback
kubectl rollout status deployment/k8s-nim-operator-controller-manager -n nim-operator
Restore Resources (if needed)
kubectl apply -f nimservices-backup.yaml
kubectl apply -f nimcaches-backup.yaml
kubectl apply -f nimpipelines-backup.yaml
Zero-Downtime Upgrades
For production environments, follow these practices:
Ensure your NIMServices have multiple replicas: apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
spec :
replicas : 3
This allows rolling updates without service interruption.
Configure Pod Disruption Budgets
Always test upgrades in a staging environment:
Deploy identical setup in staging
Perform upgrade in staging
Run validation tests
Only then upgrade production
For large deployments, upgrade in stages:
Upgrade operator first
Upgrade CRDs
Update resources gradually
Monitor each stage before proceeding
Post-Upgrade Validation
After upgrading, verify everything works correctly.
Check Operator Health
# Operator pods running
kubectl get pods -n nim-operator
# No errors in logs
kubectl logs -n nim-operator deployment/k8s-nim-operator-controller-manager | grep -i error
# Metrics endpoint accessible
kubectl port-forward -n nim-operator svc/k8s-nim-operator-metrics-service 8080:8080
curl http://localhost:8080/metrics
Verify Resource Reconciliation
# Check all resources are reconciled
kubectl get nimservice,nimcache,nimpipeline -A
# Check resource status
kubectl get nimservice -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.state}{"\n"}{end}'
Test Functionality
# Test NIMService endpoint
kubectl port-forward svc/ < nimservice-nam e > 8000:8000
curl http://localhost:8000/v1/health/ready
# Test model inference
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "<model>", "prompt": "Hello"}'
Monitor Metrics
# Check operator metrics
kubectl port-forward -n nim-operator svc/k8s-nim-operator-metrics-service 8080:8080
curl http://localhost:8080/metrics | grep nimService_status_total
Upgrade Best Practices
Plan Maintenance Windows Schedule upgrades during low-traffic periods, even if aiming for zero downtime.
Enable Admission Controller The admission controller validates resources and prevents invalid configurations. operator :
admissionController :
enabled : true
Monitor During Upgrade Watch metrics and logs during the upgrade process to catch issues early.
Document Changes Keep a record of upgrade dates, versions, and any issues encountered.
Common Upgrade Issues
CRD Conversion Webhook Failures
Problem: Webhook fails to convert between CRD versionsSolution:
Ensure cert-manager is installed if using admission controller
Check webhook certificates are valid
Verify webhook service is accessible
Resource Validation Errors
Problem: Existing resources fail validation with new CRD schemaSolution:
Review resource definitions for deprecated fields
Update resources according to migration guide
Temporarily disable admission controller if needed (not recommended)
Problem: Operator pod crashes after upgradeSolution: # Check pod events
kubectl describe pod -n nim-operator < pod-nam e >
# Check logs
kubectl logs -n nim-operator < pod-nam e >
# Verify image is accessible
kubectl get deployment -n nim-operator -o yaml
Version Compatibility Matrix
Operator Version Kubernetes Version GPU Operator Version v0.6.x 1.28+ 24.3.0+ v0.5.x 1.27+ 24.3.0+ v0.4.x 1.26+ 23.9.0+
Next Steps
Best Practices Optimize your deployments for production
Troubleshooting Resolve post-upgrade issues