Upgrade Procedures

This guide covers upgrading the NVIDIA NIM Operator, managing CRD updates, and rolling back if needed.

Overview

The NVIDIA NIM Operator follows semantic versioning. Upgrades may include:

Operator controller image updates
CRD schema changes
New features and capabilities
Bug fixes and security patches

Always review the release notes before upgrading to understand breaking changes and new features.

Pre-Upgrade Checklist

Before upgrading, complete these steps:

Review Release Notes

Check the release notes for:

Breaking changes
New features
Deprecated fields
Required actions

Backup Current State

# Backup all NIM resources
kubectl get nimservice -A -o yaml > nimservices-backup.yaml
kubectl get nimcache -A -o yaml > nimcaches-backup.yaml
kubectl get nimpipeline -A -o yaml > nimpipelines-backup.yaml

# Backup operator deployment
kubectl get deployment -n nim-operator -o yaml > operator-backup.yaml

# Backup CRDs
kubectl get crd -o yaml | grep -A 1000 "apps.nvidia.com" > crds-backup.yaml

Verify Current Version

# Check operator version
kubectl get deployment -n nim-operator k8s-nim-operator-controller-manager \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# Check CRD versions
kubectl get crd nimservices.apps.nvidia.com \
  -o jsonpath='{.spec.versions[*].name}'

Check Resource Health

# Verify all resources are healthy
kubectl get nimservice,nimcache,nimpipeline -A

# Check for any failed resources
kubectl get nimservice -A -o jsonpath='{.items[?(@.status.state=="Failed")].metadata.name}'

Review Cluster Capacity

# Check node resources
kubectl top nodes

# Check operator pod resources
kubectl top pod -n nim-operator

Upgrading with Helm

The recommended method for upgrading is using Helm.

Standard Upgrade

Update Helm Repository

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Check Available Versions

helm search repo nvidia/k8s-nim-operator --versions

Review Chart Changes

# Compare current values with new version
helm get values k8s-nim-operator -n nim-operator > current-values.yaml
helm show values nvidia/k8s-nim-operator --version <new-version> > new-values.yaml
diff current-values.yaml new-values.yaml

Perform Upgrade

helm upgrade k8s-nim-operator nvidia/k8s-nim-operator \
  --namespace nim-operator \
  --version <new-version> \
  --values current-values.yaml \
  --wait

The --wait flag ensures Helm waits for the upgrade to complete before returning.

Verify Upgrade

# Check operator version
helm list -n nim-operator

# Verify operator is running
kubectl get pods -n nim-operator

# Check operator logs
kubectl logs -n nim-operator deployment/k8s-nim-operator-controller-manager -f

Upgrade with CRD Management

By default, the Helm chart can automatically upgrade CRDs.

# values.yaml
operator:
  upgradeCRD: true  # Enable automatic CRD upgrades

Upgrade with CRD updates:

helm upgrade k8s-nim-operator nvidia/k8s-nim-operator \
  --namespace nim-operator \
  --version <new-version> \
  --set operator.upgradeCRD=true \
  --wait

CRD upgrades are applied immediately and affect all custom resources. Test in a non-production environment first.

Upgrading with kubectl

For manual installations, use kubectl to upgrade.

Update CRDs First

Download New CRDs

# Download CRD manifests for new version
curl -sL https://github.com/NVIDIA/k8s-nim-operator/releases/download/<version>/crds.yaml \
  -o crds-new.yaml

Review CRD Changes

# Compare CRDs
diff crds-backup.yaml crds-new.yaml

Apply New CRDs

kubectl apply -f crds-new.yaml

CRDs are cluster-scoped resources. You need cluster-admin permissions to update them.

Verify CRD Updates

kubectl get crd nimservices.apps.nvidia.com -o yaml

Update Operator Deployment

Update Deployment Manifest

# Download new operator manifest
curl -sL https://github.com/NVIDIA/k8s-nim-operator/releases/download/<version>/install.yaml \
  -o install-new.yaml

Apply Updated Manifest

kubectl apply -f install-new.yaml

Monitor Rollout

kubectl rollout status deployment/k8s-nim-operator-controller-manager -n nim-operator

Managing CRD Versions

Understanding CRD Versioning

The operator uses API versioning (e.g., v1alpha1) for CRDs:

kubectl get crd nimservices.apps.nvidia.com -o jsonpath='{.spec.versions[*].name}'

Checking for Deprecated Fields

After upgrading, check for deprecated fields in your resources:

# Check for deprecation warnings
kubectl get nimservice <name> -o yaml | grep -i deprecated

# Review operator logs for warnings
kubectl logs -n nim-operator deployment/k8s-nim-operator-controller-manager | grep -i deprecated

Migrating Resources

If the upgrade introduces breaking changes, migrate your resources:

Identify Affected Resources

# Get all resources
kubectl get nimservice -A -o yaml > all-nimservices.yaml

Update Resource Definitions

Edit the YAML files to update deprecated fields according to the migration guide in the release notes.

Apply Updated Resources

kubectl apply -f all-nimservices-updated.yaml

Verify Migration

kubectl get nimservice -A

Rollback Procedures

If issues occur during upgrade, you can roll back.

Rollback with Helm

List Helm History

helm history k8s-nim-operator -n nim-operator

Rollback to Previous Version

helm rollback k8s-nim-operator <revision> -n nim-operator --wait

Or rollback to previous revision:

helm rollback k8s-nim-operator -n nim-operator --wait

Verify Rollback

helm list -n nim-operator
kubectl get pods -n nim-operator

Manual Rollback

Restore CRDs (if needed)

kubectl apply -f crds-backup.yaml

Rolling back CRDs may cause issues if new resources use updated schemas. Test carefully.

Restore Operator Deployment

kubectl apply -f operator-backup.yaml

Monitor Rollback

kubectl rollout status deployment/k8s-nim-operator-controller-manager -n nim-operator

Restore Resources (if needed)

kubectl apply -f nimservices-backup.yaml
kubectl apply -f nimcaches-backup.yaml
kubectl apply -f nimpipelines-backup.yaml

Zero-Downtime Upgrades

For production environments, follow these practices:

Use Multiple Replicas

Ensure your NIMServices have multiple replicas:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
spec:
  replicas: 3

This allows rolling updates without service interruption.

Configure Pod Disruption Budgets

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nim-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: meta-llama3-8b-instruct

Test in Staging First

Always test upgrades in a staging environment:

Deploy identical setup in staging
Perform upgrade in staging
Run validation tests
Only then upgrade production

Use Gradual Rollout

For large deployments, upgrade in stages:

Upgrade operator first
Upgrade CRDs
Update resources gradually
Monitor each stage before proceeding

Post-Upgrade Validation

After upgrading, verify everything works correctly.

Check Operator Health

# Operator pods running
kubectl get pods -n nim-operator

# No errors in logs
kubectl logs -n nim-operator deployment/k8s-nim-operator-controller-manager | grep -i error

# Metrics endpoint accessible
kubectl port-forward -n nim-operator svc/k8s-nim-operator-metrics-service 8080:8080
curl http://localhost:8080/metrics

Verify Resource Reconciliation

# Check all resources are reconciled
kubectl get nimservice,nimcache,nimpipeline -A

# Check resource status
kubectl get nimservice -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.state}{"\n"}{end}'

Test Functionality

# Test NIMService endpoint
kubectl port-forward svc/<nimservice-name> 8000:8000
curl http://localhost:8000/v1/health/ready

# Test model inference
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "<model>", "prompt": "Hello"}'

Monitor Metrics

# Check operator metrics
kubectl port-forward -n nim-operator svc/k8s-nim-operator-metrics-service 8080:8080
curl http://localhost:8080/metrics | grep nimService_status_total

Upgrade Best Practices

Plan Maintenance Windows

Schedule upgrades during low-traffic periods, even if aiming for zero downtime.

Enable Admission Controller

The admission controller validates resources and prevents invalid configurations.

operator:
  admissionController:
    enabled: true

Monitor During Upgrade

Watch metrics and logs during the upgrade process to catch issues early.

Document Changes

Keep a record of upgrade dates, versions, and any issues encountered.

Common Upgrade Issues

CRD Conversion Webhook Failures

Problem: Webhook fails to convert between CRD versionsSolution:

Ensure cert-manager is installed if using admission controller
Check webhook certificates are valid
Verify webhook service is accessible

Resource Validation Errors

Problem: Existing resources fail validation with new CRD schemaSolution:

Review resource definitions for deprecated fields
Update resources according to migration guide
Temporarily disable admission controller if needed (not recommended)

Operator Fails to Start

Problem: Operator pod crashes after upgradeSolution:

# Check pod events
kubectl describe pod -n nim-operator <pod-name>

# Check logs
kubectl logs -n nim-operator <pod-name>

# Verify image is accessible
kubectl get deployment -n nim-operator -o yaml

Version Compatibility Matrix

Operator Version	Kubernetes Version	GPU Operator Version
v0.6.x	1.28+	24.3.0+
v0.5.x	1.27+	24.3.0+
v0.4.x	1.26+	23.9.0+

Always check the official compatibility matrix for the most up-to-date information.

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

Overview

Pre-Upgrade Checklist

Upgrading with Helm

Standard Upgrade

Upgrade with CRD Management

Upgrading with kubectl

Update CRDs First

Update Operator Deployment

Managing CRD Versions

Understanding CRD Versioning

Checking for Deprecated Fields

Migrating Resources

Rollback Procedures

Rollback with Helm

Manual Rollback

Zero-Downtime Upgrades

Post-Upgrade Validation

Upgrade Best Practices

Plan Maintenance Windows

Enable Admission Controller

Monitor During Upgrade

Document Changes

Common Upgrade Issues

Version Compatibility Matrix

Next Steps

Best Practices

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

​Overview

​Pre-Upgrade Checklist

​Upgrading with Helm

​Standard Upgrade

​Upgrade with CRD Management

​Upgrading with kubectl

​Update CRDs First

​Update Operator Deployment

​Managing CRD Versions

​Understanding CRD Versioning

​Checking for Deprecated Fields

​Migrating Resources

​Rollback Procedures

​Rollback with Helm

​Manual Rollback

​Zero-Downtime Upgrades

​Post-Upgrade Validation

​Upgrade Best Practices

Plan Maintenance Windows

Enable Admission Controller

Monitor During Upgrade

Document Changes

​Common Upgrade Issues

​Version Compatibility Matrix

​Next Steps

Best Practices

Troubleshooting

Build docs developers (and LLMs) love

Overview

Pre-Upgrade Checklist

Upgrading with Helm

Standard Upgrade

Upgrade with CRD Management

Upgrading with kubectl

Update CRDs First

Update Operator Deployment

Managing CRD Versions

Understanding CRD Versioning

Checking for Deprecated Fields

Migrating Resources

Rollback Procedures

Rollback with Helm

Manual Rollback

Zero-Downtime Upgrades

Post-Upgrade Validation

Upgrade Best Practices

Common Upgrade Issues

Version Compatibility Matrix

Next Steps