Skip to main content

Overview

The NVIDIA NIM Operator supports automatic scaling of NIM deployments using Kubernetes Horizontal Pod Autoscaler (HPA). Scale your NIM instances based on metrics like GPU utilization, request rate, or custom metrics.

Prerequisites

Before configuring autoscaling:
  1. Metrics Server: Required for resource-based metrics
    kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
    
  2. Prometheus: Required for custom metrics (GPU utilization, cache usage)
    • Deploy Prometheus Operator
    • Enable ServiceMonitor for your NIMService
  3. Prometheus Adapter: Required for custom metrics in HPA
    helm install prometheus-adapter prometheus-community/prometheus-adapter
    

Basic Autoscaling Configuration

Enable Autoscaling

Enable HPA with min/max replica configuration:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-8b
  namespace: nim-service
spec:
  # Remove or omit replicas when autoscaling is enabled
  # replicas: 1  # This will cause validation error
  
  scale:
    enabled: true
    hpa:
      minReplicas: 1
      maxReplicas: 5
      metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 70
When scale.enabled: true, you cannot set spec.replicas. The HPA will manage replica count automatically.
Autoscaling is not supported with multi-node deployments. Multi-node NIMService and scale.enabled: true are mutually exclusive.

HPA Metrics

Resource Metrics

Scale based on CPU or memory utilization:
scale:
  enabled: true
  hpa:
    minReplicas: 1
    maxReplicas: 10
    metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Custom Metrics (GPU Utilization)

Scale based on GPU cache usage or utilization:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-70b
spec:
  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels:
        release: kube-prometheus-stack
  
  scale:
    enabled: true
    hpa:
      minReplicas: 1
      maxReplicas: 5
      metrics:
      - type: Object
        object:
          metric:
            name: gpu_cache_usage_perc
          describedObject:
            apiVersion: v1
            kind: Service
            name: llama-3-70b
          target:
            type: Value
            value: "0.7"
metrics.enabled
boolean
default:"false"
Enable Prometheus metrics collection via ServiceMonitor
metrics.serviceMonitor.additionalLabels
object
Labels to add to ServiceMonitor for Prometheus discovery. Must match your Prometheus Operator’s serviceMonitorSelector.

Pod Metrics

Scale based on per-pod metrics:
metrics:
- type: Pods
  pods:
    metric:
      name: nim_request_rate
    target:
      type: AverageValue
      averageValue: "1000"

External Metrics

Scale based on external metrics (e.g., queue depth):
metrics:
- type: External
  external:
    metric:
      name: request_queue_length
      selector:
        matchLabels:
          queue_name: nim_inference
    target:
      type: Value
      value: "100"

Scaling Behavior

Configure how HPA scales up and down:
scale:
  enabled: true
  hpa:
    minReplicas: 1
    maxReplicas: 10
    metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    behavior:
      scaleUp:
        stabilizationWindowSeconds: 60
        policies:
        - type: Percent
          value: 50
          periodSeconds: 60
        - type: Pods
          value: 2
          periodSeconds: 60
        selectPolicy: Max
      scaleDown:
        stabilizationWindowSeconds: 300
        policies:
        - type: Percent
          value: 25
          periodSeconds: 60
        - type: Pods
          value: 1
          periodSeconds: 60
        selectPolicy: Min
behavior.scaleUp.stabilizationWindowSeconds
integer
default:"0"
Window to consider when scaling up. Prevents flapping by waiting for metrics to stabilize.
behavior.scaleDown.stabilizationWindowSeconds
integer
default:"300"
Window to consider when scaling down. Prevents premature scale-down.
behavior.scaleUp.policies
array
List of scaling policies for scale-up. Can specify percentage or absolute pod count.
behavior.scaleDown.policies
array
List of scaling policies for scale-down.

Common Configurations

Conservative Scaling

Slow, steady scaling for production stability:
scale:
  enabled: true
  hpa:
    minReplicas: 2
    maxReplicas: 8
    metrics:
    - type: Object
      object:
        metric:
          name: gpu_cache_usage_perc
        describedObject:
          apiVersion: v1
          kind: Service
          name: llama-3-70b
        target:
          type: Value
          value: "0.8"  # Scale at 80% utilization
    behavior:
      scaleUp:
        stabilizationWindowSeconds: 180  # 3 minutes
        policies:
        - type: Pods
          value: 1
          periodSeconds: 120  # Add 1 pod every 2 minutes
      scaleDown:
        stabilizationWindowSeconds: 600  # 10 minutes
        policies:
        - type: Pods
          value: 1
          periodSeconds: 180  # Remove 1 pod every 3 minutes

Aggressive Scaling

Fast scaling for variable workloads:
scale:
  enabled: true
  hpa:
    minReplicas: 1
    maxReplicas: 20
    metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60  # Scale at lower threshold
    behavior:
      scaleUp:
        stabilizationWindowSeconds: 30
        policies:
        - type: Percent
          value: 100  # Double pods
          periodSeconds: 30
      scaleDown:
        stabilizationWindowSeconds: 60
        policies:
        - type: Percent
          value: 50  # Halve pods
          periodSeconds: 60

Multi-Metric Scaling

Combine CPU, memory, and custom metrics:
scale:
  enabled: true
  hpa:
    minReplicas: 2
    maxReplicas: 10
    metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Object
      object:
        metric:
          name: gpu_cache_usage_perc
        describedObject:
          apiVersion: v1
          kind: Service
          name: llama-3-70b
        target:
          type: Value
          value: "0.75"
When multiple metrics are specified, HPA will calculate desired replicas for each metric and use the highest value.

Best Practices

Autoscaling Recommendations
  1. Min Replicas: Set to at least 2 for high availability
  2. Max Replicas: Consider cluster capacity and cost constraints
  3. Stabilization Windows: Use longer windows (5-10 min) for scale-down to avoid flapping
  4. Metrics Choice: GPU cache usage is the best metric for NIM workloads
  5. Testing: Test scaling behavior under load before production deployment

Scaling Timeline

  • Scale Up: Be aggressive (30-60 seconds) to handle traffic spikes
  • Scale Down: Be conservative (5-10 minutes) to avoid premature termination

Metric Selection

Metric TypeUse CaseProsCons
CPU/MemoryGeneral workloadsSimple, built-inNot GPU-aware
GPU CacheNIM-specificAccurate for inferenceRequires Prometheus
Request RateTraffic-basedPredictiveRequires custom metrics
Queue DepthBatch processingHandles burstsExternal dependency

Troubleshooting

HPA Not Scaling

Check HPA status:
kubectl get hpa -n <namespace>
kubectl describe hpa <nimservice-name> -n <namespace>
Common issues:
  • Metrics server not running
  • Prometheus adapter not configured
  • ServiceMonitor label mismatch
  • Insufficient cluster resources

Metrics Not Available

Verify Prometheus is scraping metrics:
# Check ServiceMonitor
kubectl get servicemonitor -n <namespace>

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Open browser to http://localhost:9090/targets

Validation Error: replicas and autoscaling

Error: spec.replicas cannot be set when spec.scale.enabled is true
Solution: Remove the replicas field when autoscaling is enabled:
spec:
  # replicas: 1  # Remove this line
  scale:
    enabled: true

Multi-Node and Autoscaling Conflict

Error: autoScaling must be nil or disabled when multiNode is set
Solution: Multi-node deployments require fixed replica count. Either disable multi-node or disable autoscaling:
spec:
  replicas: 1  # Fixed replicas for multi-node
  multiNode:
    parallelism:
      pipeline: 2
      tensor: 8
  scale:
    enabled: false  # Must be disabled

Monitoring Autoscaling

View Scaling Events

kubectl get events -n <namespace> --field-selector involvedObject.name=<nimservice-name>

HPA Status

kubectl get hpa <nimservice-name> -n <namespace> -o yaml

Current Metrics

kubectl get hpa <nimservice-name> -n <namespace> --watch

Custom Annotations

Add custom annotations to the HPA resource:
scale:
  enabled: true
  annotations:
    custom.annotation/owner: "team-ml"
    custom.annotation/environment: "production"
  hpa:
    minReplicas: 2
    maxReplicas: 10

Build docs developers (and LLMs) love