Best Practices and Recommendations

Learn more about Mintlify

Enter your email to receive updates about new features and product releases.

Resource Allocation
GPU Resources
Storage Recommendations
High Availability Configuration
Multiple Replicas
Health Probes Tuning
Security Best Practices
Secret Management
Performance Optimization
Model Caching Strategy
Autoscaling Configuration
Ingress and Load Balancing
Cost Optimization
GPU Utilization
Resource Limits and Requests
Monitoring and Observability
Enable Comprehensive Monitoring
Set Up Alerts
Multi-Node Deployments
Production Checklist
Next Steps

This guide provides best practices for deploying, managing, and optimizing NVIDIA NIM services in production environments.

Resource Allocation

GPU Resources

Properly sizing GPU resources is critical for performance and cost optimization.

GPU Memory Requirements

Different models require different amounts of GPU memory:

Model Size	Minimum GPU Memory	Recommended GPU
7B-8B parameters	16 GB	A10, L4, L40
13B parameters	24 GB	L40, A100 40GB
70B+ parameters	80 GB	A100 80GB, H100

Example configuration:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
spec:
  resources:
    limits:
      nvidia.com/gpu: 1
    requests:
      nvidia.com/gpu: 1
  # For specific GPU selection
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB

CPU and Memory Allocation

CPU Requirements:

Minimum: 8 cores per NIM instance
Recommended: 16-32 cores for production workloads

Memory Requirements:

Minimum: 16 GB RAM
Recommended: 32-64 GB RAM (depends on model size and batch size)

spec:
  resources:
    limits:
      cpu: "32"
      memory: 64Gi
      nvidia.com/gpu: 1
    requests:
      cpu: "16"
      memory: 32Gi
      nvidia.com/gpu: 1

Shared Memory Configuration

NIMs use shared memory for fast model I/O. Set appropriate limits:

spec:
  storage:
    sharedMemorySizeLimit: 8Gi  # Increase for larger models

Guidelines:

7B-13B models: 4-8 GB
70B+ models: 16-32 GB
Adjust based on batch size and concurrent requests

Storage Recommendations

Model Cache Storage
Read-Only Caches
Storage Classes

Use fast, persistent storage for model caches:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
spec:
  storage:
    pvc:
      create: true
      storageClass: fast-ssd  # Use SSD-backed storage
      size: 200Gi  # Size depends on model
      volumeAccessMode: ReadWriteMany  # For multi-node access

Storage Size Guidelines:

7B-8B models: 50-100 GB
13B models: 100-150 GB
70B+ models: 200-500 GB

For production deployments, use read-only model caches:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
spec:
  storage:
    nimCache:
      name: llama3-8b-cache
      profile: auto
    readOnly: true  # Prevent accidental modifications

Benefits:

Prevents cache corruption
Enables safe cache sharing across services
Improves security posture

Choose appropriate storage classes:

Use Case	Recommended Storage Class	Notes
Model caching	SSD/NVMe (ReadWriteMany)	Fast I/O, shared access
Development	Standard HDD	Cost-effective
Multi-node	Network-attached storage	Low latency required

# Example StorageClass for NIM caches
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nim-cache-storage
provisioner: kubernetes.io/aws-ebs  # Or your cloud provider
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
volumeBindingMode: WaitForFirstConsumer

High Availability Configuration

Multiple Replicas

Deploy multiple replicas for high availability:

Configure Minimum Replicas

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
spec:
  replicas: 3  # Minimum 2 for HA, 3+ recommended

Set Pod Disruption Budgets

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nim-service-pdb
spec:
  minAvailable: 2  # Keep at least 2 pods available
  selector:
    matchLabels:
      app: meta-llama3-8b-instruct

Configure Pod Anti-Affinity

Spread pods across nodes and availability zones:

spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: meta-llama3-8b-instruct
          topologyKey: kubernetes.io/hostname
      - weight: 50
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: meta-llama3-8b-instruct
          topologyKey: topology.kubernetes.io/zone

Health Probes Tuning

Configure appropriate health checks:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
spec:
  # Startup probe - for slow model loading
  startupProbe:
    probe:
      httpGet:
        path: /v1/health/ready
        port: api
      initialDelaySeconds: 60
      periodSeconds: 10
      failureThreshold: 180  # 30 minutes max startup time
      successThreshold: 1
      timeoutSeconds: 5

  # Readiness probe - for traffic routing
  readinessProbe:
    probe:
      httpGet:
        path: /v1/health/ready
        port: api
      initialDelaySeconds: 15
      periodSeconds: 10
      failureThreshold: 3
      successThreshold: 1
      timeoutSeconds: 5

  # Liveness probe - for pod restarts
  livenessProbe:
    probe:
      httpGet:
        path: /v1/health/live
        port: api
      initialDelaySeconds: 30
      periodSeconds: 30
      failureThreshold: 3
      successThreshold: 1
      timeoutSeconds: 5

Probe Configuration Guidelines:

startupProbe: Set high failureThreshold for large models (30+ minutes)
readinessProbe: Use shorter intervals (10s) for faster traffic routing
livenessProbe: Use longer intervals (30s) to avoid unnecessary restarts

Security Best Practices

Secret Management

NGC API Key Management

Store NGC API keys securely:

# Create secret from literal
kubectl create secret generic ngc-api-secret \
  --from-literal=NGC_API_KEY=<your-key> \
  --dry-run=client -o yaml | kubectl apply -f -

# Or use external secret managers
# Example with External Secrets Operator
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: ngc-api-secret
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: SecretStore
  target:
    name: ngc-api-secret
  data:
  - secretKey: NGC_API_KEY
    remoteRef:
      key: secret/nim/ngc
      property: api_key

RBAC Configuration

Use minimal RBAC permissions:

# Service account for NIM workloads
apiVersion: v1
kind: ServiceAccount
metadata:
  name: nim-service-account
---
# Role with minimal permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: nim-service-role
rules:
- apiGroups: [""]
  resources: ["configmaps", "secrets"]
  verbs: ["get", "list"]
---
# RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: nim-service-rolebinding
subjects:
- kind: ServiceAccount
  name: nim-service-account
roleRef:
  kind: Role
  name: nim-service-role
  apiGroup: rbac.authorization.k8s.io

Network Policies

Restrict network access to NIM services:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: nim-service-netpol
spec:
  podSelector:
    matchLabels:
      app: meta-llama3-8b-instruct
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: application-namespace
    ports:
    - protocol: TCP
      port: 8000
  egress:
  # Allow DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
  # Allow NGC access
  - to:
    - podSelector: {}
    ports:
    - protocol: TCP
      port: 443

Security Context

Run containers with non-root users:

spec:
  userID: 1000  # Non-root user
  groupID: 2000

  # Operator sets appropriate security context:
  # securityContext:
  #   runAsNonRoot: true
  #   runAsUser: 1000
  #   runAsGroup: 2000
  #   fsGroup: 2000

Performance Optimization

Model Caching Strategy

Pre-cache Models

Cache models before deploying services:

# Create cache first
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: llama3-8b-cache
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:latest
      authSecret: ngc-api-secret
  storage:
    pvc:
      create: true
      size: 100Gi
      storageClass: fast-ssd

Wait for caching to complete:

kubectl wait --for=condition=Ready nimcache/llama3-8b-cache --timeout=60m

Reuse Caches

Share caches across multiple NIMServices:

# Service 1
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: service-1
spec:
  storage:
    nimCache:
      name: llama3-8b-cache
    readOnly: true
---
# Service 2 - same cache
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: service-2
spec:
  storage:
    nimCache:
      name: llama3-8b-cache
    readOnly: true

Cache Multiple Profiles

Cache specific model profiles for different GPUs:

spec:
  source:
    ngc:
      model:
        profiles:
          - "a100-80gb-tp1"
          - "h100-80gb-tp1"

Autoscaling Configuration

Horizontal Pod Autoscaling
Custom Metrics

Enable HPA based on GPU utilization:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
spec:
  scale:
    enabled: true
    hpa:
      minReplicas: 2
      maxReplicas: 10
      metrics:
      - type: Resource
        resource:
          name: nvidia.com/gpu
          target:
            type: Utilization
            averageUtilization: 80
      - type: Pods
        pods:
          metric:
            name: nim_requests_per_second
          target:
            type: AverageValue
            averageValue: "100"
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
          - type: Percent
            value: 50
            periodSeconds: 60
        scaleUp:
          stabilizationWindowSeconds: 60
          policies:
          - type: Percent
            value: 100
            periodSeconds: 30
          - type: Pods
            value: 2
            periodSeconds: 30

Use custom metrics for better autoscaling:

metrics:
- type: Object
  object:
    metric:
      name: requests_per_second
    describedObject:
      apiVersion: v1
      kind: Service
      name: meta-llama3-8b-instruct
    target:
      type: AverageValue
      averageValue: "1000"

Autoscaling is not supported with multi-node deployments (LeaderWorkerSet).

Ingress and Load Balancing

Gateway API with Load Balancing

Use Gateway API for advanced traffic management:

spec:
  expose:
    service:
      type: ClusterIP
    router:
      hostDomainName: example.com
      gateway:
        name: nim-gateway
        namespace: gateway-system
        httpRoutesEnabled: true
        # Optional: Custom backend for endpoint picker
        backendRef:
          name: endpoint-picker-service
          port: 8080

Ingress with Session Affinity

Enable session affinity for stateful workloads:

spec:
  expose:
    service:
      type: ClusterIP
      annotations:
        # For NGINX ingress
        nginx.ingress.kubernetes.io/affinity: "cookie"
        nginx.ingress.kubernetes.io/session-cookie-name: "nim-affinity"
        nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
        nginx.ingress.kubernetes.io/session-cookie-max-age: "172800"

Cost Optimization

GPU Utilization

Right-size GPU Allocation

Monitor GPU utilization and adjust:

# Check GPU utilization
kubectl exec <pod-name> -- nvidia-smi

# Use metrics to track utilization
kubectl top pod <pod-name>

Target: 70-85% GPU utilization for cost efficiency

Use Spot/Preemptible Instances

For non-critical workloads, use spot instances:

spec:
  nodeSelector:
    node.kubernetes.io/instance-type: g4dn.xlarge
    karpenter.sh/capacity-type: spot
  tolerations:
  - key: karpenter.sh/capacity-type
    operator: Equal
    value: spot
    effect: NoSchedule

Multi-Instance GPU (MIG)

For smaller models, use MIG to share GPUs:

spec:
  resources:
    limits:
      nvidia.com/mig-1g.10gb: 1
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB-MIG-1g.10gb

Schedule Off-Peak Workloads

Use CronJobs for batch inference during off-peak hours to leverage lower cloud costs.

Resource Limits and Requests

Set appropriate limits and requests:

spec:
  resources:
    limits:
      cpu: "32"        # Hard limit
      memory: 64Gi     # Hard limit
      nvidia.com/gpu: 1
    requests:
      cpu: "16"        # Guaranteed allocation
      memory: 32Gi     # Guaranteed allocation
      nvidia.com/gpu: 1

Guidelines:

Set requests = limits for GPU resources (guaranteed QoS)
Set requests < limits for CPU/memory (burstable QoS)
Monitor actual usage and adjust accordingly

Monitoring and Observability

Enable Comprehensive Monitoring

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
spec:
  # Enable Prometheus metrics
  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels:
        prometheus: kube-prometheus
      interval: 30s
      scrapeTimeout: 10s

  # For NeMo services - enable OpenTelemetry
  otel:
    enabled: true
    exporterOtlpEndpoint: http://otel-collector:4317
    logLevel: INFO
    exporterConfig:
      tracesExporter: otlp
      metricsExporter: otlp
      logsExporter: otlp

Set Up Alerts

Create Prometheus alerts for critical conditions:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: nim-service-alerts
spec:
  groups:
  - name: nim-service
    interval: 30s
    rules:
    - alert: NIMServiceDown
      expr: nimService_status_total{status="Failed"} > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "NIMService {{ $labels.name }} is down"

    - alert: NIMCacheFailed
      expr: nimCache_status_total{status="Failed"} > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "NIMCache {{ $labels.name }} failed"

    - alert: LowGPUUtilization
      expr: nvidia_gpu_utilization < 30
      for: 30m
      labels:
        severity: info
      annotations:
        summary: "Low GPU utilization on {{ $labels.pod }}"

Multi-Node Deployments

For large models requiring multiple GPUs:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama3-70b
spec:
  replicas: 2  # Number of replica groups
  multiNode:
    backendType: lws
    parallelism:
      tensor: 4      # GPUs per node
      pipeline: 2    # Nodes per replica
    mpi:
      mpiStartTimeout: 600  # 10 minutes
  
  resources:
    limits:
      nvidia.com/gpu: 4  # Per pod
    requests:
      nvidia.com/gpu: 4

Multi-Node Limitations:

Autoscaling is not supported
Requires LeaderWorkerSet CRD
All nodes must have identical GPU configurations

Production Checklist

Before going to production, verify:

Next Steps

Monitoring

Set up comprehensive monitoring and observability

Troubleshooting

Learn how to diagnose and resolve issues

Upgrade Procedures

⌘I

Build docs developers (and LLMs) love

Get started for free Talk to us

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

Best Practices and Recommendations

Resource Allocation

GPU Resources

Storage Recommendations

High Availability Configuration

Multiple Replicas

Health Probes Tuning

Security Best Practices

Secret Management

Performance Optimization

Model Caching Strategy

Autoscaling Configuration

Ingress and Load Balancing

Cost Optimization

GPU Utilization

Right-size GPU Allocation

Use Spot/Preemptible Instances

Multi-Instance GPU (MIG)

Schedule Off-Peak Workloads

Resource Limits and Requests

Monitoring and Observability

Enable Comprehensive Monitoring

Set Up Alerts

Multi-Node Deployments

Production Checklist

Next Steps

Monitoring

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

​Resource Allocation

​GPU Resources

​Storage Recommendations

​High Availability Configuration

​Multiple Replicas

​Health Probes Tuning

​Security Best Practices

​Secret Management

​Performance Optimization

​Model Caching Strategy

​Autoscaling Configuration

​Ingress and Load Balancing

​Cost Optimization

​GPU Utilization

Right-size GPU Allocation

Use Spot/Preemptible Instances

Multi-Instance GPU (MIG)

Schedule Off-Peak Workloads

​Resource Limits and Requests

​Monitoring and Observability

​Enable Comprehensive Monitoring

​Set Up Alerts

​Multi-Node Deployments

​Production Checklist

​Next Steps

Monitoring

Troubleshooting

Build docs developers (and LLMs) love

Resource Allocation

GPU Resources

Storage Recommendations

High Availability Configuration

Multiple Replicas

Health Probes Tuning

Security Best Practices

Secret Management

Performance Optimization

Model Caching Strategy

Autoscaling Configuration

Ingress and Load Balancing

Cost Optimization

GPU Utilization

Resource Limits and Requests

Monitoring and Observability

Enable Comprehensive Monitoring

Set Up Alerts

Multi-Node Deployments

Production Checklist

Next Steps