This guide provides best practices for deploying, managing, and optimizing NVIDIA NIM services in production environments.
Resource Allocation
GPU Resources
Properly sizing GPU resources is critical for performance and cost optimization.
Different models require different amounts of GPU memory: Model Size Minimum GPU Memory Recommended GPU 7B-8B parameters 16 GB A10, L4, L40 13B parameters 24 GB L40, A100 40GB 70B+ parameters 80 GB A100 80GB, H100
Example configuration: apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
spec :
resources :
limits :
nvidia.com/gpu : 1
requests :
nvidia.com/gpu : 1
# For specific GPU selection
nodeSelector :
nvidia.com/gpu.product : NVIDIA-A100-SXM4-80GB
CPU and Memory Allocation
CPU Requirements:
Minimum: 8 cores per NIM instance
Recommended: 16-32 cores for production workloads
Memory Requirements:
Minimum: 16 GB RAM
Recommended: 32-64 GB RAM (depends on model size and batch size)
spec :
resources :
limits :
cpu : "32"
memory : 64Gi
nvidia.com/gpu : 1
requests :
cpu : "16"
memory : 32Gi
nvidia.com/gpu : 1
Shared Memory Configuration
NIMs use shared memory for fast model I/O. Set appropriate limits: spec :
storage :
sharedMemorySizeLimit : 8Gi # Increase for larger models
Guidelines:
7B-13B models: 4-8 GB
70B+ models: 16-32 GB
Adjust based on batch size and concurrent requests
Storage Recommendations
Model Cache Storage
Read-Only Caches
Storage Classes
Use fast, persistent storage for model caches: apiVersion : apps.nvidia.com/v1alpha1
kind : NIMCache
spec :
storage :
pvc :
create : true
storageClass : fast-ssd # Use SSD-backed storage
size : 200Gi # Size depends on model
volumeAccessMode : ReadWriteMany # For multi-node access
Storage Size Guidelines:
7B-8B models: 50-100 GB
13B models: 100-150 GB
70B+ models: 200-500 GB
For production deployments, use read-only model caches: apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
spec :
storage :
nimCache :
name : llama3-8b-cache
profile : auto
readOnly : true # Prevent accidental modifications
Benefits:
Prevents cache corruption
Enables safe cache sharing across services
Improves security posture
Choose appropriate storage classes: Use Case Recommended Storage Class Notes Model caching SSD/NVMe (ReadWriteMany) Fast I/O, shared access Development Standard HDD Cost-effective Multi-node Network-attached storage Low latency required
# Example StorageClass for NIM caches
apiVersion : storage.k8s.io/v1
kind : StorageClass
metadata :
name : nim-cache-storage
provisioner : kubernetes.io/aws-ebs # Or your cloud provider
parameters :
type : gp3
iops : "3000"
throughput : "125"
volumeBindingMode : WaitForFirstConsumer
High Availability Configuration
Multiple Replicas
Deploy multiple replicas for high availability:
Configure Minimum Replicas
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
spec :
replicas : 3 # Minimum 2 for HA, 3+ recommended
Set Pod Disruption Budgets
apiVersion : policy/v1
kind : PodDisruptionBudget
metadata :
name : nim-service-pdb
spec :
minAvailable : 2 # Keep at least 2 pods available
selector :
matchLabels :
app : meta-llama3-8b-instruct
Configure Pod Anti-Affinity
Spread pods across nodes and availability zones: spec :
affinity :
podAntiAffinity :
preferredDuringSchedulingIgnoredDuringExecution :
- weight : 100
podAffinityTerm :
labelSelector :
matchLabels :
app : meta-llama3-8b-instruct
topologyKey : kubernetes.io/hostname
- weight : 50
podAffinityTerm :
labelSelector :
matchLabels :
app : meta-llama3-8b-instruct
topologyKey : topology.kubernetes.io/zone
Health Probes Tuning
Configure appropriate health checks:
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
spec :
# Startup probe - for slow model loading
startupProbe :
probe :
httpGet :
path : /v1/health/ready
port : api
initialDelaySeconds : 60
periodSeconds : 10
failureThreshold : 180 # 30 minutes max startup time
successThreshold : 1
timeoutSeconds : 5
# Readiness probe - for traffic routing
readinessProbe :
probe :
httpGet :
path : /v1/health/ready
port : api
initialDelaySeconds : 15
periodSeconds : 10
failureThreshold : 3
successThreshold : 1
timeoutSeconds : 5
# Liveness probe - for pod restarts
livenessProbe :
probe :
httpGet :
path : /v1/health/live
port : api
initialDelaySeconds : 30
periodSeconds : 30
failureThreshold : 3
successThreshold : 1
timeoutSeconds : 5
Probe Configuration Guidelines:
startupProbe: Set high failureThreshold for large models (30+ minutes)
readinessProbe: Use shorter intervals (10s) for faster traffic routing
livenessProbe: Use longer intervals (30s) to avoid unnecessary restarts
Security Best Practices
Secret Management
Store NGC API keys securely: # Create secret from literal
kubectl create secret generic ngc-api-secret \
--from-literal=NGC_API_KEY= < your-key > \
--dry-run=client -o yaml | kubectl apply -f -
# Or use external secret managers
# Example with External Secrets Operator
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: ngc-api-secret
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: SecretStore
target:
name: ngc-api-secret
data:
- secretKey: NGC_API_KEY
remoteRef:
key: secret/nim/ngc
property: api_key
Use minimal RBAC permissions: # Service account for NIM workloads
apiVersion : v1
kind : ServiceAccount
metadata :
name : nim-service-account
---
# Role with minimal permissions
apiVersion : rbac.authorization.k8s.io/v1
kind : Role
metadata :
name : nim-service-role
rules :
- apiGroups : [ "" ]
resources : [ "configmaps" , "secrets" ]
verbs : [ "get" , "list" ]
---
# RoleBinding
apiVersion : rbac.authorization.k8s.io/v1
kind : RoleBinding
metadata :
name : nim-service-rolebinding
subjects :
- kind : ServiceAccount
name : nim-service-account
roleRef :
kind : Role
name : nim-service-role
apiGroup : rbac.authorization.k8s.io
Restrict network access to NIM services: apiVersion : networking.k8s.io/v1
kind : NetworkPolicy
metadata :
name : nim-service-netpol
spec :
podSelector :
matchLabels :
app : meta-llama3-8b-instruct
policyTypes :
- Ingress
- Egress
ingress :
- from :
- namespaceSelector :
matchLabels :
name : application-namespace
ports :
- protocol : TCP
port : 8000
egress :
# Allow DNS
- to :
- namespaceSelector :
matchLabels :
name : kube-system
ports :
- protocol : UDP
port : 53
# Allow NGC access
- to :
- podSelector : {}
ports :
- protocol : TCP
port : 443
Run containers with non-root users: spec :
userID : 1000 # Non-root user
groupID : 2000
# Operator sets appropriate security context:
# securityContext:
# runAsNonRoot: true
# runAsUser: 1000
# runAsGroup: 2000
# fsGroup: 2000
Model Caching Strategy
Pre-cache Models
Cache models before deploying services: # Create cache first
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMCache
metadata :
name : llama3-8b-cache
spec :
source :
ngc :
modelPuller : nvcr.io/nim/meta/llama3-8b-instruct:latest
authSecret : ngc-api-secret
storage :
pvc :
create : true
size : 100Gi
storageClass : fast-ssd
Wait for caching to complete: kubectl wait --for=condition=Ready nimcache/llama3-8b-cache --timeout=60m
Reuse Caches
Share caches across multiple NIMServices: # Service 1
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
metadata :
name : service-1
spec :
storage :
nimCache :
name : llama3-8b-cache
readOnly : true
---
# Service 2 - same cache
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
metadata :
name : service-2
spec :
storage :
nimCache :
name : llama3-8b-cache
readOnly : true
Cache Multiple Profiles
Cache specific model profiles for different GPUs: spec :
source :
ngc :
model :
profiles :
- "a100-80gb-tp1"
- "h100-80gb-tp1"
Autoscaling Configuration
Enable HPA based on GPU utilization: apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
spec :
scale :
enabled : true
hpa :
minReplicas : 2
maxReplicas : 10
metrics :
- type : Resource
resource :
name : nvidia.com/gpu
target :
type : Utilization
averageUtilization : 80
- type : Pods
pods :
metric :
name : nim_requests_per_second
target :
type : AverageValue
averageValue : "100"
behavior :
scaleDown :
stabilizationWindowSeconds : 300
policies :
- type : Percent
value : 50
periodSeconds : 60
scaleUp :
stabilizationWindowSeconds : 60
policies :
- type : Percent
value : 100
periodSeconds : 30
- type : Pods
value : 2
periodSeconds : 30
Use custom metrics for better autoscaling: metrics :
- type : Object
object :
metric :
name : requests_per_second
describedObject :
apiVersion : v1
kind : Service
name : meta-llama3-8b-instruct
target :
type : AverageValue
averageValue : "1000"
Autoscaling is not supported with multi-node deployments (LeaderWorkerSet).
Ingress and Load Balancing
Gateway API with Load Balancing
Use Gateway API for advanced traffic management: spec :
expose :
service :
type : ClusterIP
router :
hostDomainName : example.com
gateway :
name : nim-gateway
namespace : gateway-system
httpRoutesEnabled : true
# Optional: Custom backend for endpoint picker
backendRef :
name : endpoint-picker-service
port : 8080
Ingress with Session Affinity
Enable session affinity for stateful workloads: spec :
expose :
service :
type : ClusterIP
annotations :
# For NGINX ingress
nginx.ingress.kubernetes.io/affinity : "cookie"
nginx.ingress.kubernetes.io/session-cookie-name : "nim-affinity"
nginx.ingress.kubernetes.io/session-cookie-expires : "172800"
nginx.ingress.kubernetes.io/session-cookie-max-age : "172800"
Cost Optimization
GPU Utilization
Right-size GPU Allocation Monitor GPU utilization and adjust: # Check GPU utilization
kubectl exec < pod-nam e > -- nvidia-smi
# Use metrics to track utilization
kubectl top pod < pod-nam e >
Target: 70-85% GPU utilization for cost efficiency
Use Spot/Preemptible Instances For non-critical workloads, use spot instances: spec :
nodeSelector :
node.kubernetes.io/instance-type : g4dn.xlarge
karpenter.sh/capacity-type : spot
tolerations :
- key : karpenter.sh/capacity-type
operator : Equal
value : spot
effect : NoSchedule
Multi-Instance GPU (MIG) For smaller models, use MIG to share GPUs: spec :
resources :
limits :
nvidia.com/mig-1g.10gb : 1
nodeSelector :
nvidia.com/gpu.product : NVIDIA-A100-SXM4-80GB-MIG-1g.10gb
Schedule Off-Peak Workloads Use CronJobs for batch inference during off-peak hours to leverage lower cloud costs.
Resource Limits and Requests
Set appropriate limits and requests:
spec :
resources :
limits :
cpu : "32" # Hard limit
memory : 64Gi # Hard limit
nvidia.com/gpu : 1
requests :
cpu : "16" # Guaranteed allocation
memory : 32Gi # Guaranteed allocation
nvidia.com/gpu : 1
Guidelines:
Set requests = limits for GPU resources (guaranteed QoS)
Set requests < limits for CPU/memory (burstable QoS)
Monitor actual usage and adjust accordingly
Monitoring and Observability
Enable Comprehensive Monitoring
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
spec :
# Enable Prometheus metrics
metrics :
enabled : true
serviceMonitor :
additionalLabels :
prometheus : kube-prometheus
interval : 30s
scrapeTimeout : 10s
# For NeMo services - enable OpenTelemetry
otel :
enabled : true
exporterOtlpEndpoint : http://otel-collector:4317
logLevel : INFO
exporterConfig :
tracesExporter : otlp
metricsExporter : otlp
logsExporter : otlp
Set Up Alerts
Create Prometheus alerts for critical conditions:
apiVersion : monitoring.coreos.com/v1
kind : PrometheusRule
metadata :
name : nim-service-alerts
spec :
groups :
- name : nim-service
interval : 30s
rules :
- alert : NIMServiceDown
expr : nimService_status_total{status="Failed"} > 0
for : 5m
labels :
severity : critical
annotations :
summary : "NIMService {{ $labels.name }} is down"
- alert : NIMCacheFailed
expr : nimCache_status_total{status="Failed"} > 0
for : 5m
labels :
severity : warning
annotations :
summary : "NIMCache {{ $labels.name }} failed"
- alert : LowGPUUtilization
expr : nvidia_gpu_utilization < 30
for : 30m
labels :
severity : info
annotations :
summary : "Low GPU utilization on {{ $labels.pod }}"
Multi-Node Deployments
For large models requiring multiple GPUs:
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
metadata :
name : llama3-70b
spec :
replicas : 2 # Number of replica groups
multiNode :
backendType : lws
parallelism :
tensor : 4 # GPUs per node
pipeline : 2 # Nodes per replica
mpi :
mpiStartTimeout : 600 # 10 minutes
resources :
limits :
nvidia.com/gpu : 4 # Per pod
requests :
nvidia.com/gpu : 4
Multi-Node Limitations:
Autoscaling is not supported
Requires LeaderWorkerSet CRD
All nodes must have identical GPU configurations
Production Checklist
Before going to production, verify:
Next Steps
Monitoring Set up comprehensive monitoring and observability
Troubleshooting Learn how to diagnose and resolve issues