Installation
This guide covers installing the NVIDIA NIM Operator in your Kubernetes cluster.
Prerequisites
Before installing the operator, ensure your cluster meets these requirements:
Kubernetes version
Kubernetes v1.28 or higher is required.
NVIDIA GPU Operator
Install the NVIDIA GPU Operator to provide GPU device plugins and drivers. helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace
cert-manager (recommended)
The operator uses cert-manager for admission webhook certificates when the admission controller is enabled (default). kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
Cluster access
Ensure you have cluster-admin privileges to install CRDs and cluster-scoped resources.
The admission controller can be disabled if cert-manager is not available, but it’s recommended for production use to validate resource configurations.
Installation methods
Helm (recommended)
kubectl
Development
Install with Helm Helm is the recommended installation method as it simplifies upgrades and configuration management. Add the Helm repository helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
Install the operator Install with default settings: helm install nim-operator nvidia/k8s-nim-operator \
-n nim-operator \
--create-namespace
Install with custom configuration Create a values.yaml file with your configuration: operator :
replicas : 1
image :
repository : ghcr.io/nvidia/k8s-nim-operator
tag : main
pullPolicy : Always
pullSecrets : []
# Operator resource limits
resources :
limits :
cpu : "1"
memory : 256Mi
requests :
cpu : 500m
memory : 128Mi
# Logging configuration
log :
level : info # debug | info | warn | error
encoder : json # json | console
stacktraceLevel : error
# Admission controller settings
admissionController :
enabled : true
tls :
mode : "cert-manager" # cert-manager | secret
certManager :
issuerType : "selfsigned" # selfsigned | clusterissuer | issuer
issuerName : ""
dnsNames : []
# Node scheduling
nodeSelector : {}
tolerations :
- key : "node-role.kubernetes.io/control-plane"
operator : "Equal"
value : ""
effect : "NoSchedule"
affinity :
nodeAffinity :
preferredDuringSchedulingIgnoredDuringExecution :
- weight : 1
preference :
matchExpressions :
- key : "node-role.kubernetes.io/control-plane"
operator : In
values : [ "" ]
# Enable NVIDIA Node Feature Discovery rules
nfd :
nodeFeatureRules :
deviceID : false
# Enable Dynamo (optional dependency)
dynamo :
enabled : false
Install with custom values: helm install nim-operator nvidia/k8s-nim-operator \
-n nim-operator \
--create-namespace \
-f values.yaml
Upgrade the operator helm upgrade nim-operator nvidia/k8s-nim-operator \
-n nim-operator \
-f values.yaml
Install with kubectl For environments where Helm is not available, you can install using kubectl and manifests. Install CRDs kubectl apply -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/crds.yaml
Install operator kubectl apply -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/install.yaml
This will:
Create the nim-operator namespace
Deploy the operator controller
Configure RBAC permissions
Set up admission webhooks
When using kubectl installation, you need to manually update CRDs during upgrades. Helm handles this automatically.
Install for development For development and testing, you can build and deploy from source: Clone the repository git clone https://github.com/NVIDIA/k8s-nim-operator.git
cd k8s-nim-operator
Build the operator image make build \
IMAGE_NAME= < your-registr y > /k8s-nim-operator \
VERSION= < ta g > \
-f deployments/container/Makefile
Push to registry docker push < your-registr y > /k8s-nim-operator: < ta g >
Install CRDs Deploy operator make deploy IMG= < your-registr y > /k8s-nim-operator: < ta g >
Verify installation
Check that the operator is running:
kubectl get pods -n nim-operator
Expected output:
NAME READY STATUS RESTARTS AGE
nim-operator-controller-manager-7d9f8c5b6d-x9k2p 1/1 Running 0 2m
Verify the CRDs are installed:
kubectl get crds | grep nvidia.com
Expected output:
nimbuilds.apps.nvidia.com
nimcaches.apps.nvidia.com
nimpipelines.apps.nvidia.com
nimservices.apps.nvidia.com
nemocustomizers.apps.nvidia.com
nemodatastores.apps.nvidia.com
nemoentitystores.apps.nvidia.com
nemoevaluators.apps.nvidia.com
nemoguardrails.apps.nvidia.com
Check operator logs:
kubectl logs -n nim-operator -l control-plane=controller-manager
Configuration options
Operator arguments
The operator accepts these command-line arguments:
--health-probe-bind-address - Address for health probe server (default: :8081)
--metrics-bind-address - Address for metrics server (default: :8080)
--leader-elect - Enable leader election for HA deployments
Environment variables
Key environment variables:
WATCH_NAMESPACE - Namespace to watch (empty = all namespaces)
OPERATOR_NAMESPACE - Namespace where operator is deployed
OPERATOR_VERSION - Operator version (set automatically)
Logging configuration
Configure logging behavior via Helm values:
operator :
log :
development : false
level : info # debug | info | warn | error | dpanic | panic | fatal
encoder : json # json | console
stacktraceLevel : error
Resource limits
Adjust operator resource requirements:
operator :
resources :
limits :
cpu : "1"
memory : 256Mi
requests :
cpu : 500m
memory : 128Mi
High availability
For production deployments, enable multiple replicas with leader election:
operator :
replicas : 3
args :
- --health-probe-bind-address=:8081
- --metrics-bind-address=:8080
- --leader-elect
Admission controller configuration
The admission controller validates and mutates NIM resources before they’re persisted.
cert-manager (recommended)
Custom ClusterIssuer
User-provided secret
Disabled
operator :
admissionController :
enabled : true
tls :
mode : "cert-manager"
certManager :
issuerType : "selfsigned"
issuerName : ""
dnsNames : []
Disabling the admission controller removes validation and defaulting for NIM resources, which may lead to misconfigurations.
Security context
The operator runs with these security settings by default:
operator :
podSecurityContext :
seccompProfile :
type : RuntimeDefault
runAsNonRoot : true
containerSecurityContext :
allowPrivilegeEscalation : false
capabilities :
drop :
- ALL
These settings are compatible with restricted pod security standards and OpenShift SCCs.
OpenShift
The operator works out-of-the-box on OpenShift with automatic SCC handling:
helm install nim-operator nvidia/k8s-nim-operator \
-n nim-operator \
--create-namespace
NIM deployments automatically use appropriate SCCs:
nonroot - Default for standard deployments
anyuid - When using proxy certificates
hostmount-anyuid - When using hostPath volumes
VMware TKGS
For VMware Tanzu Kubernetes Grid Service:
operator :
podSecurityContext :
seccompProfile :
type : RuntimeDefault
runAsNonRoot : true
This is the default configuration and works with TKGS pod security policies.
Air-gapped environments
For air-gapped deployments:
Mirror the operator image to your private registry
Configure image pull secrets:
operator :
image :
repository : <your-registry>/k8s-nim-operator
tag : <version>
pullSecrets :
- name : private-registry-secret
Ensure all dependency images are also mirrored (cert-manager, etc.)
Upgrade
Helm upgrade
Upgrade to the latest version:
helm repo update
helm upgrade nim-operator nvidia/k8s-nim-operator \
-n nim-operator \
-f values.yaml
kubectl upgrade
For kubectl installations, upgrade CRDs first:
kubectl apply -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/crds.yaml
kubectl apply -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/install.yaml
The operator automatically upgrades CRDs when using Helm with operator.upgradeCRD: true (default).
Uninstall
Uninstall with Helm
helm uninstall nim-operator -n nim-operator
Uninstall with kubectl
Delete operator resources:
kubectl delete -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/install.yaml
Delete CRDs (this will delete all NIM resources):
kubectl delete -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/crds.yaml
Deleting CRDs will delete all NIMService, NIMCache, and other NIM resources in your cluster. Ensure you have backups if needed.
Delete namespace
kubectl delete namespace nim-operator
Troubleshooting
Operator pod not starting
If the operator pod fails to start:
Check image pull permissions
Verify cert-manager is running (if admission controller is enabled)
Check resource availability
kubectl describe pod -n nim-operator -l control-plane=controller-manager
Webhook certificate issues
If you see webhook-related errors:
kubectl get certificates -n nim-operator
kubectl describe certificate -n nim-operator
kubectl logs -n cert-manager -l app=cert-manager
CRD installation fails
If CRD installation fails, ensure you have cluster-admin privileges:
kubectl auth can-i create customresourcedefinitions
Operator logs show errors
Increase log verbosity for debugging:
operator :
log :
level : debug
development : true
Next steps
Quick start Deploy your first NIM microservice
NIMService guide Learn about NIMService configuration options
NIMCache guide Understand model caching strategies
Production best practices Configure for production deployments