Overview
The NVIDIA NIM Operator supports automatic scaling of NIM deployments using Kubernetes Horizontal Pod Autoscaler (HPA). Scale your NIM instances based on metrics like GPU utilization, request rate, or custom metrics.
Prerequisites
Before configuring autoscaling:
-
Metrics Server: Required for resource-based metrics
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
-
Prometheus: Required for custom metrics (GPU utilization, cache usage)
- Deploy Prometheus Operator
- Enable ServiceMonitor for your NIMService
-
Prometheus Adapter: Required for custom metrics in HPA
helm install prometheus-adapter prometheus-community/prometheus-adapter
Basic Autoscaling Configuration
Enable Autoscaling
Enable HPA with min/max replica configuration:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: llama-3-8b
namespace: nim-service
spec:
# Remove or omit replicas when autoscaling is enabled
# replicas: 1 # This will cause validation error
scale:
enabled: true
hpa:
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
When scale.enabled: true, you cannot set spec.replicas. The HPA will manage replica count automatically.
Autoscaling is not supported with multi-node deployments. Multi-node NIMService and scale.enabled: true are mutually exclusive.
HPA Metrics
Resource Metrics
Scale based on CPU or memory utilization:
CPU Utilization
Memory Utilization
Multiple Metrics
scale:
enabled: true
hpa:
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
scale:
enabled: true
hpa:
minReplicas: 2
maxReplicas: 8
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
scale:
enabled: true
hpa:
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Custom Metrics (GPU Utilization)
Scale based on GPU cache usage or utilization:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: llama-3-70b
spec:
metrics:
enabled: true
serviceMonitor:
additionalLabels:
release: kube-prometheus-stack
scale:
enabled: true
hpa:
minReplicas: 1
maxReplicas: 5
metrics:
- type: Object
object:
metric:
name: gpu_cache_usage_perc
describedObject:
apiVersion: v1
kind: Service
name: llama-3-70b
target:
type: Value
value: "0.7"
Enable Prometheus metrics collection via ServiceMonitor
metrics.serviceMonitor.additionalLabels
Labels to add to ServiceMonitor for Prometheus discovery. Must match your Prometheus Operator’s serviceMonitorSelector.
Pod Metrics
Scale based on per-pod metrics:
metrics:
- type: Pods
pods:
metric:
name: nim_request_rate
target:
type: AverageValue
averageValue: "1000"
External Metrics
Scale based on external metrics (e.g., queue depth):
metrics:
- type: External
external:
metric:
name: request_queue_length
selector:
matchLabels:
queue_name: nim_inference
target:
type: Value
value: "100"
Scaling Behavior
Configure how HPA scales up and down:
scale:
enabled: true
hpa:
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 60
- type: Pods
value: 1
periodSeconds: 60
selectPolicy: Min
behavior.scaleUp.stabilizationWindowSeconds
Window to consider when scaling up. Prevents flapping by waiting for metrics to stabilize.
behavior.scaleDown.stabilizationWindowSeconds
Window to consider when scaling down. Prevents premature scale-down.
behavior.scaleUp.policies
List of scaling policies for scale-up. Can specify percentage or absolute pod count.
behavior.scaleDown.policies
List of scaling policies for scale-down.
Common Configurations
Conservative Scaling
Slow, steady scaling for production stability:
scale:
enabled: true
hpa:
minReplicas: 2
maxReplicas: 8
metrics:
- type: Object
object:
metric:
name: gpu_cache_usage_perc
describedObject:
apiVersion: v1
kind: Service
name: llama-3-70b
target:
type: Value
value: "0.8" # Scale at 80% utilization
behavior:
scaleUp:
stabilizationWindowSeconds: 180 # 3 minutes
policies:
- type: Pods
value: 1
periodSeconds: 120 # Add 1 pod every 2 minutes
scaleDown:
stabilizationWindowSeconds: 600 # 10 minutes
policies:
- type: Pods
value: 1
periodSeconds: 180 # Remove 1 pod every 3 minutes
Aggressive Scaling
Fast scaling for variable workloads:
scale:
enabled: true
hpa:
minReplicas: 1
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # Scale at lower threshold
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100 # Double pods
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50 # Halve pods
periodSeconds: 60
Multi-Metric Scaling
Combine CPU, memory, and custom metrics:
scale:
enabled: true
hpa:
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Object
object:
metric:
name: gpu_cache_usage_perc
describedObject:
apiVersion: v1
kind: Service
name: llama-3-70b
target:
type: Value
value: "0.75"
When multiple metrics are specified, HPA will calculate desired replicas for each metric and use the highest value.
Best Practices
Autoscaling Recommendations
- Min Replicas: Set to at least 2 for high availability
- Max Replicas: Consider cluster capacity and cost constraints
- Stabilization Windows: Use longer windows (5-10 min) for scale-down to avoid flapping
- Metrics Choice: GPU cache usage is the best metric for NIM workloads
- Testing: Test scaling behavior under load before production deployment
Scaling Timeline
- Scale Up: Be aggressive (30-60 seconds) to handle traffic spikes
- Scale Down: Be conservative (5-10 minutes) to avoid premature termination
Metric Selection
| Metric Type | Use Case | Pros | Cons |
|---|
| CPU/Memory | General workloads | Simple, built-in | Not GPU-aware |
| GPU Cache | NIM-specific | Accurate for inference | Requires Prometheus |
| Request Rate | Traffic-based | Predictive | Requires custom metrics |
| Queue Depth | Batch processing | Handles bursts | External dependency |
Troubleshooting
HPA Not Scaling
Check HPA status:
kubectl get hpa -n <namespace>
kubectl describe hpa <nimservice-name> -n <namespace>
Common issues:
- Metrics server not running
- Prometheus adapter not configured
- ServiceMonitor label mismatch
- Insufficient cluster resources
Metrics Not Available
Verify Prometheus is scraping metrics:
# Check ServiceMonitor
kubectl get servicemonitor -n <namespace>
# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Open browser to http://localhost:9090/targets
Validation Error: replicas and autoscaling
Error: spec.replicas cannot be set when spec.scale.enabled is true
Solution: Remove the replicas field when autoscaling is enabled:
spec:
# replicas: 1 # Remove this line
scale:
enabled: true
Multi-Node and Autoscaling Conflict
Error: autoScaling must be nil or disabled when multiNode is set
Solution: Multi-node deployments require fixed replica count. Either disable multi-node or disable autoscaling:
spec:
replicas: 1 # Fixed replicas for multi-node
multiNode:
parallelism:
pipeline: 2
tensor: 8
scale:
enabled: false # Must be disabled
Monitoring Autoscaling
View Scaling Events
kubectl get events -n <namespace> --field-selector involvedObject.name=<nimservice-name>
HPA Status
kubectl get hpa <nimservice-name> -n <namespace> -o yaml
Current Metrics
kubectl get hpa <nimservice-name> -n <namespace> --watch
Custom Annotations
Add custom annotations to the HPA resource:
scale:
enabled: true
annotations:
custom.annotation/owner: "team-ml"
custom.annotation/environment: "production"
hpa:
minReplicas: 2
maxReplicas: 10