Skip to main content
This guide covers upgrading the NVIDIA NIM Operator, managing CRD updates, and rolling back if needed.

Overview

The NVIDIA NIM Operator follows semantic versioning. Upgrades may include:
  • Operator controller image updates
  • CRD schema changes
  • New features and capabilities
  • Bug fixes and security patches
Always review the release notes before upgrading to understand breaking changes and new features.

Pre-Upgrade Checklist

Before upgrading, complete these steps:
1

Review Release Notes

Check the release notes for:
  • Breaking changes
  • New features
  • Deprecated fields
  • Required actions
2

Backup Current State

# Backup all NIM resources
kubectl get nimservice -A -o yaml > nimservices-backup.yaml
kubectl get nimcache -A -o yaml > nimcaches-backup.yaml
kubectl get nimpipeline -A -o yaml > nimpipelines-backup.yaml

# Backup operator deployment
kubectl get deployment -n nim-operator -o yaml > operator-backup.yaml

# Backup CRDs
kubectl get crd -o yaml | grep -A 1000 "apps.nvidia.com" > crds-backup.yaml
3

Verify Current Version

# Check operator version
kubectl get deployment -n nim-operator k8s-nim-operator-controller-manager \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# Check CRD versions
kubectl get crd nimservices.apps.nvidia.com \
  -o jsonpath='{.spec.versions[*].name}'
4

Check Resource Health

# Verify all resources are healthy
kubectl get nimservice,nimcache,nimpipeline -A

# Check for any failed resources
kubectl get nimservice -A -o jsonpath='{.items[?(@.status.state=="Failed")].metadata.name}'
5

Review Cluster Capacity

# Check node resources
kubectl top nodes

# Check operator pod resources
kubectl top pod -n nim-operator

Upgrading with Helm

The recommended method for upgrading is using Helm.

Standard Upgrade

1

Update Helm Repository

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
2

Check Available Versions

helm search repo nvidia/k8s-nim-operator --versions
3

Review Chart Changes

# Compare current values with new version
helm get values k8s-nim-operator -n nim-operator > current-values.yaml
helm show values nvidia/k8s-nim-operator --version <new-version> > new-values.yaml
diff current-values.yaml new-values.yaml
4

Perform Upgrade

helm upgrade k8s-nim-operator nvidia/k8s-nim-operator \
  --namespace nim-operator \
  --version <new-version> \
  --values current-values.yaml \
  --wait
The --wait flag ensures Helm waits for the upgrade to complete before returning.
5

Verify Upgrade

# Check operator version
helm list -n nim-operator

# Verify operator is running
kubectl get pods -n nim-operator

# Check operator logs
kubectl logs -n nim-operator deployment/k8s-nim-operator-controller-manager -f

Upgrade with CRD Management

By default, the Helm chart can automatically upgrade CRDs.
# values.yaml
operator:
  upgradeCRD: true  # Enable automatic CRD upgrades
Upgrade with CRD updates:
helm upgrade k8s-nim-operator nvidia/k8s-nim-operator \
  --namespace nim-operator \
  --version <new-version> \
  --set operator.upgradeCRD=true \
  --wait
CRD upgrades are applied immediately and affect all custom resources. Test in a non-production environment first.

Upgrading with kubectl

For manual installations, use kubectl to upgrade.

Update CRDs First

1

Download New CRDs

# Download CRD manifests for new version
curl -sL https://github.com/NVIDIA/k8s-nim-operator/releases/download/<version>/crds.yaml \
  -o crds-new.yaml
2

Review CRD Changes

# Compare CRDs
diff crds-backup.yaml crds-new.yaml
3

Apply New CRDs

kubectl apply -f crds-new.yaml
CRDs are cluster-scoped resources. You need cluster-admin permissions to update them.
4

Verify CRD Updates

kubectl get crd nimservices.apps.nvidia.com -o yaml

Update Operator Deployment

1

Update Deployment Manifest

# Download new operator manifest
curl -sL https://github.com/NVIDIA/k8s-nim-operator/releases/download/<version>/install.yaml \
  -o install-new.yaml
2

Apply Updated Manifest

kubectl apply -f install-new.yaml
3

Monitor Rollout

kubectl rollout status deployment/k8s-nim-operator-controller-manager -n nim-operator

Managing CRD Versions

Understanding CRD Versioning

The operator uses API versioning (e.g., v1alpha1) for CRDs:
kubectl get crd nimservices.apps.nvidia.com -o jsonpath='{.spec.versions[*].name}'

Checking for Deprecated Fields

After upgrading, check for deprecated fields in your resources:
# Check for deprecation warnings
kubectl get nimservice <name> -o yaml | grep -i deprecated

# Review operator logs for warnings
kubectl logs -n nim-operator deployment/k8s-nim-operator-controller-manager | grep -i deprecated

Migrating Resources

If the upgrade introduces breaking changes, migrate your resources:
1

Identify Affected Resources

# Get all resources
kubectl get nimservice -A -o yaml > all-nimservices.yaml
2

Update Resource Definitions

Edit the YAML files to update deprecated fields according to the migration guide in the release notes.
3

Apply Updated Resources

kubectl apply -f all-nimservices-updated.yaml
4

Verify Migration

kubectl get nimservice -A

Rollback Procedures

If issues occur during upgrade, you can roll back.

Rollback with Helm

1

List Helm History

helm history k8s-nim-operator -n nim-operator
2

Rollback to Previous Version

helm rollback k8s-nim-operator <revision> -n nim-operator --wait
Or rollback to previous revision:
helm rollback k8s-nim-operator -n nim-operator --wait
3

Verify Rollback

helm list -n nim-operator
kubectl get pods -n nim-operator

Manual Rollback

1

Restore CRDs (if needed)

kubectl apply -f crds-backup.yaml
Rolling back CRDs may cause issues if new resources use updated schemas. Test carefully.
2

Restore Operator Deployment

kubectl apply -f operator-backup.yaml
3

Monitor Rollback

kubectl rollout status deployment/k8s-nim-operator-controller-manager -n nim-operator
4

Restore Resources (if needed)

kubectl apply -f nimservices-backup.yaml
kubectl apply -f nimcaches-backup.yaml
kubectl apply -f nimpipelines-backup.yaml

Zero-Downtime Upgrades

For production environments, follow these practices:
Ensure your NIMServices have multiple replicas:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
spec:
  replicas: 3
This allows rolling updates without service interruption.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nim-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: meta-llama3-8b-instruct
Always test upgrades in a staging environment:
  1. Deploy identical setup in staging
  2. Perform upgrade in staging
  3. Run validation tests
  4. Only then upgrade production
For large deployments, upgrade in stages:
  1. Upgrade operator first
  2. Upgrade CRDs
  3. Update resources gradually
  4. Monitor each stage before proceeding

Post-Upgrade Validation

After upgrading, verify everything works correctly.
1

Check Operator Health

# Operator pods running
kubectl get pods -n nim-operator

# No errors in logs
kubectl logs -n nim-operator deployment/k8s-nim-operator-controller-manager | grep -i error

# Metrics endpoint accessible
kubectl port-forward -n nim-operator svc/k8s-nim-operator-metrics-service 8080:8080
curl http://localhost:8080/metrics
2

Verify Resource Reconciliation

# Check all resources are reconciled
kubectl get nimservice,nimcache,nimpipeline -A

# Check resource status
kubectl get nimservice -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.state}{"\n"}{end}'
3

Test Functionality

# Test NIMService endpoint
kubectl port-forward svc/<nimservice-name> 8000:8000
curl http://localhost:8000/v1/health/ready

# Test model inference
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "<model>", "prompt": "Hello"}'
4

Monitor Metrics

# Check operator metrics
kubectl port-forward -n nim-operator svc/k8s-nim-operator-metrics-service 8080:8080
curl http://localhost:8080/metrics | grep nimService_status_total

Upgrade Best Practices

Plan Maintenance Windows

Schedule upgrades during low-traffic periods, even if aiming for zero downtime.

Enable Admission Controller

The admission controller validates resources and prevents invalid configurations.
operator:
  admissionController:
    enabled: true

Monitor During Upgrade

Watch metrics and logs during the upgrade process to catch issues early.

Document Changes

Keep a record of upgrade dates, versions, and any issues encountered.

Common Upgrade Issues

Problem: Webhook fails to convert between CRD versionsSolution:
  • Ensure cert-manager is installed if using admission controller
  • Check webhook certificates are valid
  • Verify webhook service is accessible
Problem: Existing resources fail validation with new CRD schemaSolution:
  • Review resource definitions for deprecated fields
  • Update resources according to migration guide
  • Temporarily disable admission controller if needed (not recommended)
Problem: Operator pod crashes after upgradeSolution:
# Check pod events
kubectl describe pod -n nim-operator <pod-name>

# Check logs
kubectl logs -n nim-operator <pod-name>

# Verify image is accessible
kubectl get deployment -n nim-operator -o yaml

Version Compatibility Matrix

Operator VersionKubernetes VersionGPU Operator Version
v0.6.x1.28+24.3.0+
v0.5.x1.27+24.3.0+
v0.4.x1.26+23.9.0+
Always check the official compatibility matrix for the most up-to-date information.

Next Steps

Best Practices

Optimize your deployments for production

Troubleshooting

Resolve post-upgrade issues

Build docs developers (and LLMs) love