Autoscaling Configuration

Overview

The NVIDIA NIM Operator supports automatic scaling of NIM deployments using Kubernetes Horizontal Pod Autoscaler (HPA). Scale your NIM instances based on metrics like GPU utilization, request rate, or custom metrics.

Prerequisites

Before configuring autoscaling:

Metrics Server: Required for resource-based metrics

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Prometheus: Required for custom metrics (GPU utilization, cache usage)
- Deploy Prometheus Operator
- Enable ServiceMonitor for your NIMService

Prometheus Adapter: Required for custom metrics in HPA

helm install prometheus-adapter prometheus-community/prometheus-adapter

Basic Autoscaling Configuration

Enable Autoscaling

Enable HPA with min/max replica configuration:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-8b
  namespace: nim-service
spec:
  # Remove or omit replicas when autoscaling is enabled
  # replicas: 1  # This will cause validation error
  
  scale:
    enabled: true
    hpa:
      minReplicas: 1
      maxReplicas: 5
      metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 70

When scale.enabled: true, you cannot set spec.replicas. The HPA will manage replica count automatically.

Autoscaling is not supported with multi-node deployments. Multi-node NIMService and scale.enabled: true are mutually exclusive.

HPA Metrics

Resource Metrics

Scale based on CPU or memory utilization:

CPU Utilization
Memory Utilization
Multiple Metrics

scale:
  enabled: true
  hpa:
    minReplicas: 1
    maxReplicas: 10
    metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

scale:
  enabled: true
  hpa:
    minReplicas: 2
    maxReplicas: 8
    metrics:
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

scale:
  enabled: true
  hpa:
    minReplicas: 1
    maxReplicas: 10
    metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

Custom Metrics (GPU Utilization)

Scale based on GPU cache usage or utilization:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3-70b
spec:
  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels:
        release: kube-prometheus-stack
  
  scale:
    enabled: true
    hpa:
      minReplicas: 1
      maxReplicas: 5
      metrics:
      - type: Object
        object:
          metric:
            name: gpu_cache_usage_perc
          describedObject:
            apiVersion: v1
            kind: Service
            name: llama-3-70b
          target:
            type: Value
            value: "0.7"

metrics.enabled

boolean

default:"false"

Enable Prometheus metrics collection via ServiceMonitor

metrics.serviceMonitor.additionalLabels

object

Labels to add to ServiceMonitor for Prometheus discovery. Must match your Prometheus Operator’s serviceMonitorSelector.

Pod Metrics

Scale based on per-pod metrics:

metrics:
- type: Pods
  pods:
    metric:
      name: nim_request_rate
    target:
      type: AverageValue
      averageValue: "1000"

External Metrics

Scale based on external metrics (e.g., queue depth):

metrics:
- type: External
  external:
    metric:
      name: request_queue_length
      selector:
        matchLabels:
          queue_name: nim_inference
    target:
      type: Value
      value: "100"

Scaling Behavior

Configure how HPA scales up and down:

scale:
  enabled: true
  hpa:
    minReplicas: 1
    maxReplicas: 10
    metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    behavior:
      scaleUp:
        stabilizationWindowSeconds: 60
        policies:
        - type: Percent
          value: 50
          periodSeconds: 60
        - type: Pods
          value: 2
          periodSeconds: 60
        selectPolicy: Max
      scaleDown:
        stabilizationWindowSeconds: 300
        policies:
        - type: Percent
          value: 25
          periodSeconds: 60
        - type: Pods
          value: 1
          periodSeconds: 60
        selectPolicy: Min

behavior.scaleUp.stabilizationWindowSeconds

integer

default:"0"

Window to consider when scaling up. Prevents flapping by waiting for metrics to stabilize.

behavior.scaleDown.stabilizationWindowSeconds

integer

default:"300"

Window to consider when scaling down. Prevents premature scale-down.

behavior.scaleUp.policies

array

List of scaling policies for scale-up. Can specify percentage or absolute pod count.

behavior.scaleDown.policies

array

List of scaling policies for scale-down.

Common Configurations

Conservative Scaling

Slow, steady scaling for production stability:

scale:
  enabled: true
  hpa:
    minReplicas: 2
    maxReplicas: 8
    metrics:
    - type: Object
      object:
        metric:
          name: gpu_cache_usage_perc
        describedObject:
          apiVersion: v1
          kind: Service
          name: llama-3-70b
        target:
          type: Value
          value: "0.8"  # Scale at 80% utilization
    behavior:
      scaleUp:
        stabilizationWindowSeconds: 180  # 3 minutes
        policies:
        - type: Pods
          value: 1
          periodSeconds: 120  # Add 1 pod every 2 minutes
      scaleDown:
        stabilizationWindowSeconds: 600  # 10 minutes
        policies:
        - type: Pods
          value: 1
          periodSeconds: 180  # Remove 1 pod every 3 minutes

Aggressive Scaling

Fast scaling for variable workloads:

scale:
  enabled: true
  hpa:
    minReplicas: 1
    maxReplicas: 20
    metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60  # Scale at lower threshold
    behavior:
      scaleUp:
        stabilizationWindowSeconds: 30
        policies:
        - type: Percent
          value: 100  # Double pods
          periodSeconds: 30
      scaleDown:
        stabilizationWindowSeconds: 60
        policies:
        - type: Percent
          value: 50  # Halve pods
          periodSeconds: 60

Multi-Metric Scaling

Combine CPU, memory, and custom metrics:

scale:
  enabled: true
  hpa:
    minReplicas: 2
    maxReplicas: 10
    metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Object
      object:
        metric:
          name: gpu_cache_usage_perc
        describedObject:
          apiVersion: v1
          kind: Service
          name: llama-3-70b
        target:
          type: Value
          value: "0.75"

When multiple metrics are specified, HPA will calculate desired replicas for each metric and use the highest value.

Best Practices

Autoscaling Recommendations

Min Replicas: Set to at least 2 for high availability
Max Replicas: Consider cluster capacity and cost constraints
Stabilization Windows: Use longer windows (5-10 min) for scale-down to avoid flapping
Metrics Choice: GPU cache usage is the best metric for NIM workloads
Testing: Test scaling behavior under load before production deployment

Scaling Timeline

Scale Up: Be aggressive (30-60 seconds) to handle traffic spikes
Scale Down: Be conservative (5-10 minutes) to avoid premature termination

Metric Selection

Metric Type	Use Case	Pros	Cons
CPU/Memory	General workloads	Simple, built-in	Not GPU-aware
GPU Cache	NIM-specific	Accurate for inference	Requires Prometheus
Request Rate	Traffic-based	Predictive	Requires custom metrics
Queue Depth	Batch processing	Handles bursts	External dependency

Troubleshooting

HPA Not Scaling

Check HPA status:

kubectl get hpa -n <namespace>
kubectl describe hpa <nimservice-name> -n <namespace>

Common issues:

Metrics server not running
Prometheus adapter not configured
ServiceMonitor label mismatch
Insufficient cluster resources

Metrics Not Available

Verify Prometheus is scraping metrics:

# Check ServiceMonitor
kubectl get servicemonitor -n <namespace>

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Open browser to http://localhost:9090/targets

Validation Error: replicas and autoscaling

Error: spec.replicas cannot be set when spec.scale.enabled is true

Solution: Remove the replicas field when autoscaling is enabled:

spec:
  # replicas: 1  # Remove this line
  scale:
    enabled: true

Multi-Node and Autoscaling Conflict

Error: autoScaling must be nil or disabled when multiNode is set

Solution: Multi-node deployments require fixed replica count. Either disable multi-node or disable autoscaling:

spec:
  replicas: 1  # Fixed replicas for multi-node
  multiNode:
    parallelism:
      pipeline: 2
      tensor: 8
  scale:
    enabled: false  # Must be disabled

Monitoring Autoscaling

View Scaling Events

kubectl get events -n <namespace> --field-selector involvedObject.name=<nimservice-name>

HPA Status

kubectl get hpa <nimservice-name> -n <namespace> -o yaml

Current Metrics

kubectl get hpa <nimservice-name> -n <namespace> --watch

Custom Annotations

Add custom annotations to the HPA resource:

scale:
  enabled: true
  annotations:
    custom.annotation/owner: "team-ml"
    custom.annotation/environment: "production"
  hpa:
    minReplicas: 2
    maxReplicas: 10

Resource Management - Configure GPU and memory resources
Networking Configuration - Configure service exposure and load balancing
Metrics and Monitoring - Set up Prometheus and ServiceMonitor

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

Overview

Prerequisites

Basic Autoscaling Configuration

Enable Autoscaling

HPA Metrics

Resource Metrics

Custom Metrics (GPU Utilization)

Pod Metrics

External Metrics

Scaling Behavior

Common Configurations

Conservative Scaling

Aggressive Scaling

Multi-Metric Scaling

Best Practices

Scaling Timeline

Metric Selection

Troubleshooting

HPA Not Scaling

Metrics Not Available

Validation Error: replicas and autoscaling

Multi-Node and Autoscaling Conflict

Monitoring Autoscaling

View Scaling Events

HPA Status

Current Metrics

Custom Annotations

Build docs developers (and LLMs) love

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

​Overview

​Prerequisites

​Basic Autoscaling Configuration

​Enable Autoscaling

​HPA Metrics

​Resource Metrics

​Custom Metrics (GPU Utilization)

​Pod Metrics

​External Metrics

​Scaling Behavior

​Common Configurations

​Conservative Scaling

​Aggressive Scaling

​Multi-Metric Scaling

​Best Practices

​Scaling Timeline

​Metric Selection

​Troubleshooting

​HPA Not Scaling

​Metrics Not Available

​Validation Error: replicas and autoscaling

​Multi-Node and Autoscaling Conflict

​Monitoring Autoscaling

​View Scaling Events

​HPA Status

​Current Metrics

​Custom Annotations

​Related Resources

Build docs developers (and LLMs) love

Overview

Prerequisites

Basic Autoscaling Configuration

Enable Autoscaling

HPA Metrics

Resource Metrics

Custom Metrics (GPU Utilization)

Pod Metrics

External Metrics

Scaling Behavior

Common Configurations

Conservative Scaling

Aggressive Scaling

Multi-Metric Scaling

Best Practices

Scaling Timeline

Metric Selection

Troubleshooting

HPA Not Scaling

Metrics Not Available

Validation Error: replicas and autoscaling

Multi-Node and Autoscaling Conflict

Monitoring Autoscaling

View Scaling Events

HPA Status

Current Metrics

Custom Annotations

Related Resources