Skip to main content
This guide provides best practices for deploying, managing, and optimizing NVIDIA NIM services in production environments.

Resource Allocation

GPU Resources

Properly sizing GPU resources is critical for performance and cost optimization.
Different models require different amounts of GPU memory:
Model SizeMinimum GPU MemoryRecommended GPU
7B-8B parameters16 GBA10, L4, L40
13B parameters24 GBL40, A100 40GB
70B+ parameters80 GBA100 80GB, H100
Example configuration:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
spec:
  resources:
    limits:
      nvidia.com/gpu: 1
    requests:
      nvidia.com/gpu: 1
  # For specific GPU selection
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
CPU Requirements:
  • Minimum: 8 cores per NIM instance
  • Recommended: 16-32 cores for production workloads
Memory Requirements:
  • Minimum: 16 GB RAM
  • Recommended: 32-64 GB RAM (depends on model size and batch size)
spec:
  resources:
    limits:
      cpu: "32"
      memory: 64Gi
      nvidia.com/gpu: 1
    requests:
      cpu: "16"
      memory: 32Gi
      nvidia.com/gpu: 1
NIMs use shared memory for fast model I/O. Set appropriate limits:
spec:
  storage:
    sharedMemorySizeLimit: 8Gi  # Increase for larger models
Guidelines:
  • 7B-13B models: 4-8 GB
  • 70B+ models: 16-32 GB
  • Adjust based on batch size and concurrent requests

Storage Recommendations

Use fast, persistent storage for model caches:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
spec:
  storage:
    pvc:
      create: true
      storageClass: fast-ssd  # Use SSD-backed storage
      size: 200Gi  # Size depends on model
      volumeAccessMode: ReadWriteMany  # For multi-node access
Storage Size Guidelines:
  • 7B-8B models: 50-100 GB
  • 13B models: 100-150 GB
  • 70B+ models: 200-500 GB

High Availability Configuration

Multiple Replicas

Deploy multiple replicas for high availability:
1

Configure Minimum Replicas

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
spec:
  replicas: 3  # Minimum 2 for HA, 3+ recommended
2

Set Pod Disruption Budgets

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nim-service-pdb
spec:
  minAvailable: 2  # Keep at least 2 pods available
  selector:
    matchLabels:
      app: meta-llama3-8b-instruct
3

Configure Pod Anti-Affinity

Spread pods across nodes and availability zones:
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: meta-llama3-8b-instruct
          topologyKey: kubernetes.io/hostname
      - weight: 50
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: meta-llama3-8b-instruct
          topologyKey: topology.kubernetes.io/zone

Health Probes Tuning

Configure appropriate health checks:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
spec:
  # Startup probe - for slow model loading
  startupProbe:
    probe:
      httpGet:
        path: /v1/health/ready
        port: api
      initialDelaySeconds: 60
      periodSeconds: 10
      failureThreshold: 180  # 30 minutes max startup time
      successThreshold: 1
      timeoutSeconds: 5

  # Readiness probe - for traffic routing
  readinessProbe:
    probe:
      httpGet:
        path: /v1/health/ready
        port: api
      initialDelaySeconds: 15
      periodSeconds: 10
      failureThreshold: 3
      successThreshold: 1
      timeoutSeconds: 5

  # Liveness probe - for pod restarts
  livenessProbe:
    probe:
      httpGet:
        path: /v1/health/live
        port: api
      initialDelaySeconds: 30
      periodSeconds: 30
      failureThreshold: 3
      successThreshold: 1
      timeoutSeconds: 5
Probe Configuration Guidelines:
  • startupProbe: Set high failureThreshold for large models (30+ minutes)
  • readinessProbe: Use shorter intervals (10s) for faster traffic routing
  • livenessProbe: Use longer intervals (30s) to avoid unnecessary restarts

Security Best Practices

Secret Management

Store NGC API keys securely:
# Create secret from literal
kubectl create secret generic ngc-api-secret \
  --from-literal=NGC_API_KEY=<your-key> \
  --dry-run=client -o yaml | kubectl apply -f -

# Or use external secret managers
# Example with External Secrets Operator
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: ngc-api-secret
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: SecretStore
  target:
    name: ngc-api-secret
  data:
  - secretKey: NGC_API_KEY
    remoteRef:
      key: secret/nim/ngc
      property: api_key
Use minimal RBAC permissions:
# Service account for NIM workloads
apiVersion: v1
kind: ServiceAccount
metadata:
  name: nim-service-account
---
# Role with minimal permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: nim-service-role
rules:
- apiGroups: [""]
  resources: ["configmaps", "secrets"]
  verbs: ["get", "list"]
---
# RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: nim-service-rolebinding
subjects:
- kind: ServiceAccount
  name: nim-service-account
roleRef:
  kind: Role
  name: nim-service-role
  apiGroup: rbac.authorization.k8s.io
Restrict network access to NIM services:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: nim-service-netpol
spec:
  podSelector:
    matchLabels:
      app: meta-llama3-8b-instruct
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: application-namespace
    ports:
    - protocol: TCP
      port: 8000
  egress:
  # Allow DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
  # Allow NGC access
  - to:
    - podSelector: {}
    ports:
    - protocol: TCP
      port: 443
Run containers with non-root users:
spec:
  userID: 1000  # Non-root user
  groupID: 2000

  # Operator sets appropriate security context:
  # securityContext:
  #   runAsNonRoot: true
  #   runAsUser: 1000
  #   runAsGroup: 2000
  #   fsGroup: 2000

Performance Optimization

Model Caching Strategy

1

Pre-cache Models

Cache models before deploying services:
# Create cache first
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: llama3-8b-cache
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:latest
      authSecret: ngc-api-secret
  storage:
    pvc:
      create: true
      size: 100Gi
      storageClass: fast-ssd
Wait for caching to complete:
kubectl wait --for=condition=Ready nimcache/llama3-8b-cache --timeout=60m
2

Reuse Caches

Share caches across multiple NIMServices:
# Service 1
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: service-1
spec:
  storage:
    nimCache:
      name: llama3-8b-cache
    readOnly: true
---
# Service 2 - same cache
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: service-2
spec:
  storage:
    nimCache:
      name: llama3-8b-cache
    readOnly: true
3

Cache Multiple Profiles

Cache specific model profiles for different GPUs:
spec:
  source:
    ngc:
      model:
        profiles:
          - "a100-80gb-tp1"
          - "h100-80gb-tp1"

Autoscaling Configuration

Enable HPA based on GPU utilization:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
spec:
  scale:
    enabled: true
    hpa:
      minReplicas: 2
      maxReplicas: 10
      metrics:
      - type: Resource
        resource:
          name: nvidia.com/gpu
          target:
            type: Utilization
            averageUtilization: 80
      - type: Pods
        pods:
          metric:
            name: nim_requests_per_second
          target:
            type: AverageValue
            averageValue: "100"
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
          - type: Percent
            value: 50
            periodSeconds: 60
        scaleUp:
          stabilizationWindowSeconds: 60
          policies:
          - type: Percent
            value: 100
            periodSeconds: 30
          - type: Pods
            value: 2
            periodSeconds: 30
Autoscaling is not supported with multi-node deployments (LeaderWorkerSet).

Ingress and Load Balancing

Use Gateway API for advanced traffic management:
spec:
  expose:
    service:
      type: ClusterIP
    router:
      hostDomainName: example.com
      gateway:
        name: nim-gateway
        namespace: gateway-system
        httpRoutesEnabled: true
        # Optional: Custom backend for endpoint picker
        backendRef:
          name: endpoint-picker-service
          port: 8080
Enable session affinity for stateful workloads:
spec:
  expose:
    service:
      type: ClusterIP
      annotations:
        # For NGINX ingress
        nginx.ingress.kubernetes.io/affinity: "cookie"
        nginx.ingress.kubernetes.io/session-cookie-name: "nim-affinity"
        nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
        nginx.ingress.kubernetes.io/session-cookie-max-age: "172800"

Cost Optimization

GPU Utilization

Right-size GPU Allocation

Monitor GPU utilization and adjust:
# Check GPU utilization
kubectl exec <pod-name> -- nvidia-smi

# Use metrics to track utilization
kubectl top pod <pod-name>
Target: 70-85% GPU utilization for cost efficiency

Use Spot/Preemptible Instances

For non-critical workloads, use spot instances:
spec:
  nodeSelector:
    node.kubernetes.io/instance-type: g4dn.xlarge
    karpenter.sh/capacity-type: spot
  tolerations:
  - key: karpenter.sh/capacity-type
    operator: Equal
    value: spot
    effect: NoSchedule

Multi-Instance GPU (MIG)

For smaller models, use MIG to share GPUs:
spec:
  resources:
    limits:
      nvidia.com/mig-1g.10gb: 1
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB-MIG-1g.10gb

Schedule Off-Peak Workloads

Use CronJobs for batch inference during off-peak hours to leverage lower cloud costs.

Resource Limits and Requests

Set appropriate limits and requests:
spec:
  resources:
    limits:
      cpu: "32"        # Hard limit
      memory: 64Gi     # Hard limit
      nvidia.com/gpu: 1
    requests:
      cpu: "16"        # Guaranteed allocation
      memory: 32Gi     # Guaranteed allocation
      nvidia.com/gpu: 1
Guidelines:
  • Set requests = limits for GPU resources (guaranteed QoS)
  • Set requests < limits for CPU/memory (burstable QoS)
  • Monitor actual usage and adjust accordingly

Monitoring and Observability

Enable Comprehensive Monitoring

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
spec:
  # Enable Prometheus metrics
  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels:
        prometheus: kube-prometheus
      interval: 30s
      scrapeTimeout: 10s

  # For NeMo services - enable OpenTelemetry
  otel:
    enabled: true
    exporterOtlpEndpoint: http://otel-collector:4317
    logLevel: INFO
    exporterConfig:
      tracesExporter: otlp
      metricsExporter: otlp
      logsExporter: otlp

Set Up Alerts

Create Prometheus alerts for critical conditions:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: nim-service-alerts
spec:
  groups:
  - name: nim-service
    interval: 30s
    rules:
    - alert: NIMServiceDown
      expr: nimService_status_total{status="Failed"} > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "NIMService {{ $labels.name }} is down"

    - alert: NIMCacheFailed
      expr: nimCache_status_total{status="Failed"} > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "NIMCache {{ $labels.name }} failed"

    - alert: LowGPUUtilization
      expr: nvidia_gpu_utilization < 30
      for: 30m
      labels:
        severity: info
      annotations:
        summary: "Low GPU utilization on {{ $labels.pod }}"

Multi-Node Deployments

For large models requiring multiple GPUs:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama3-70b
spec:
  replicas: 2  # Number of replica groups
  multiNode:
    backendType: lws
    parallelism:
      tensor: 4      # GPUs per node
      pipeline: 2    # Nodes per replica
    mpi:
      mpiStartTimeout: 600  # 10 minutes
  
  resources:
    limits:
      nvidia.com/gpu: 4  # Per pod
    requests:
      nvidia.com/gpu: 4
Multi-Node Limitations:
  • Autoscaling is not supported
  • Requires LeaderWorkerSet CRD
  • All nodes must have identical GPU configurations

Production Checklist

Before going to production, verify:
1

Resource Configuration

  • GPU resources properly sized
  • CPU and memory limits set
  • Shared memory configured
  • Storage class selected (SSD recommended)
2

High Availability

  • Multiple replicas configured (3+ recommended)
  • Pod disruption budgets set
  • Pod anti-affinity rules configured
  • Health probes tuned appropriately
3

Security

  • Secrets stored securely
  • RBAC configured with minimal permissions
  • Network policies applied
  • Running as non-root user
4

Monitoring

  • Metrics collection enabled
  • ServiceMonitor configured
  • Alerts set up
  • Logging aggregation configured
5

Performance

  • Model caches pre-created
  • Autoscaling configured (if needed)
  • Ingress/Gateway properly configured
  • Load testing completed

Next Steps

Monitoring

Set up comprehensive monitoring and observability

Troubleshooting

Learn how to diagnose and resolve issues

Build docs developers (and LLMs) love