Skip to main content

Installation

This guide covers installing the NVIDIA NIM Operator in your Kubernetes cluster.

Prerequisites

Before installing the operator, ensure your cluster meets these requirements:
1

Kubernetes version

Kubernetes v1.28 or higher is required.
kubectl version --short
2

NVIDIA GPU Operator

Install the NVIDIA GPU Operator to provide GPU device plugins and drivers.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace
3

cert-manager (recommended)

The operator uses cert-manager for admission webhook certificates when the admission controller is enabled (default).
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
4

Cluster access

Ensure you have cluster-admin privileges to install CRDs and cluster-scoped resources.
The admission controller can be disabled if cert-manager is not available, but it’s recommended for production use to validate resource configurations.

Installation methods

Verify installation

Check that the operator is running:
kubectl get pods -n nim-operator
Expected output:
NAME                                           READY   STATUS    RESTARTS   AGE
nim-operator-controller-manager-7d9f8c5b6d-x9k2p   1/1     Running   0          2m
Verify the CRDs are installed:
kubectl get crds | grep nvidia.com
Expected output:
nimbuilds.apps.nvidia.com
nimcaches.apps.nvidia.com
nimpipelines.apps.nvidia.com
nimservices.apps.nvidia.com
nemocustomizers.apps.nvidia.com
nemodatastores.apps.nvidia.com
nemoentitystores.apps.nvidia.com
nemoevaluators.apps.nvidia.com
nemoguardrails.apps.nvidia.com
Check operator logs:
kubectl logs -n nim-operator -l control-plane=controller-manager

Configuration options

Operator arguments

The operator accepts these command-line arguments:
  • --health-probe-bind-address - Address for health probe server (default: :8081)
  • --metrics-bind-address - Address for metrics server (default: :8080)
  • --leader-elect - Enable leader election for HA deployments

Environment variables

Key environment variables:
  • WATCH_NAMESPACE - Namespace to watch (empty = all namespaces)
  • OPERATOR_NAMESPACE - Namespace where operator is deployed
  • OPERATOR_VERSION - Operator version (set automatically)

Logging configuration

Configure logging behavior via Helm values:
operator:
  log:
    development: false
    level: info  # debug | info | warn | error | dpanic | panic | fatal
    encoder: json  # json | console
    stacktraceLevel: error

Resource limits

Adjust operator resource requirements:
operator:
  resources:
    limits:
      cpu: "1"
      memory: 256Mi
    requests:
      cpu: 500m
      memory: 128Mi

High availability

For production deployments, enable multiple replicas with leader election:
operator:
  replicas: 3
  args:
    - --health-probe-bind-address=:8081
    - --metrics-bind-address=:8080
    - --leader-elect

Admission controller configuration

The admission controller validates and mutates NIM resources before they’re persisted.
operator:
  admissionController:
    enabled: true
    tls:
      mode: "cert-manager"
      certManager:
        issuerType: "selfsigned"
        issuerName: ""
        dnsNames: []
Disabling the admission controller removes validation and defaulting for NIM resources, which may lead to misconfigurations.

Security context

The operator runs with these security settings by default:
operator:
  podSecurityContext:
    seccompProfile:
      type: RuntimeDefault
    runAsNonRoot: true
  containerSecurityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL
These settings are compatible with restricted pod security standards and OpenShift SCCs.

Platform-specific configuration

OpenShift

The operator works out-of-the-box on OpenShift with automatic SCC handling:
helm install nim-operator nvidia/k8s-nim-operator \
  -n nim-operator \
  --create-namespace
NIM deployments automatically use appropriate SCCs:
  • nonroot - Default for standard deployments
  • anyuid - When using proxy certificates
  • hostmount-anyuid - When using hostPath volumes

VMware TKGS

For VMware Tanzu Kubernetes Grid Service:
operator:
  podSecurityContext:
    seccompProfile:
      type: RuntimeDefault
    runAsNonRoot: true
This is the default configuration and works with TKGS pod security policies.

Air-gapped environments

For air-gapped deployments:
  1. Mirror the operator image to your private registry
  2. Configure image pull secrets:
operator:
  image:
    repository: <your-registry>/k8s-nim-operator
    tag: <version>
    pullSecrets:
      - name: private-registry-secret
  1. Ensure all dependency images are also mirrored (cert-manager, etc.)

Upgrade

Helm upgrade

Upgrade to the latest version:
helm repo update
helm upgrade nim-operator nvidia/k8s-nim-operator \
  -n nim-operator \
  -f values.yaml

kubectl upgrade

For kubectl installations, upgrade CRDs first:
kubectl apply -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/crds.yaml
kubectl apply -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/install.yaml
The operator automatically upgrades CRDs when using Helm with operator.upgradeCRD: true (default).

Uninstall

Uninstall with Helm

helm uninstall nim-operator -n nim-operator

Uninstall with kubectl

Delete operator resources:
kubectl delete -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/install.yaml
Delete CRDs (this will delete all NIM resources):
kubectl delete -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/crds.yaml
Deleting CRDs will delete all NIMService, NIMCache, and other NIM resources in your cluster. Ensure you have backups if needed.

Delete namespace

kubectl delete namespace nim-operator

Troubleshooting

Operator pod not starting

If the operator pod fails to start:
  1. Check image pull permissions
  2. Verify cert-manager is running (if admission controller is enabled)
  3. Check resource availability
kubectl describe pod -n nim-operator -l control-plane=controller-manager

Webhook certificate issues

If you see webhook-related errors:
kubectl get certificates -n nim-operator
kubectl describe certificate -n nim-operator
kubectl logs -n cert-manager -l app=cert-manager

CRD installation fails

If CRD installation fails, ensure you have cluster-admin privileges:
kubectl auth can-i create customresourcedefinitions

Operator logs show errors

Increase log verbosity for debugging:
operator:
  log:
    level: debug
    development: true

Next steps

Quick start

Deploy your first NIM microservice

NIMService guide

Learn about NIMService configuration options

NIMCache guide

Understand model caching strategies

Production best practices

Configure for production deployments

Build docs developers (and LLMs) love