Skip to main content

Overview

Autoscaling automatically adjusts the number of running instances based on demand. This is crucial for ML services with variable traffic patterns.

Types of Autoscaling

Vertical Scaling

Adjust CPU/memory/GPU resources of individual pods. Pros:
  • Simple to implement
  • No application changes needed
  • Good for predictable workloads
Cons:
  • Limited by node capacity
  • Requires pod restart
  • No redundancy benefits
Tools:

Horizontal Scaling

Increase or decrease the number of pod replicas. Pros:
  • Better fault tolerance
  • No pod restart needed
  • Scales beyond single node
  • Better cost efficiency
Cons:
  • Requires stateless applications
  • More complex setup
  • Slower scale-up time
Tools:
  • Kubernetes HPA
  • KNative Autoscaling
  • KEDA
This guide focuses on horizontal scaling as it’s the most common approach for ML services.

Horizontal Pod Autoscaler (HPA)

Prerequisites

Install Metrics Server

HPA requires metrics-server to collect resource metrics:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
For local clusters (kind, minikube), patch to allow insecure TLS:
kubectl patch -n kube-system deployment metrics-server --type=json \
  -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'
Verify metrics are available:
kubectl top nodes
kubectl top pods

Configure Resource Requests

HPA requires resource requests to be set on your deployment:
app-fastapi-resources.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-fastapi
spec:
  replicas: 2
  selector:
    matchLabels:
      app: app-fastapi
  template:
    metadata:
      labels:
        app: app-fastapi
    spec:
      containers:
        - name: app-fastapi
          image: ghcr.io/kyryl-opens-ml/app-fastapi:latest
          env:
          - name: WANDB_API_KEY
            valueFrom:
              secretKeyRef:
                name: wandb
                key: WANDB_API_KEY
          resources:
            limits:
              cpu: 500m
            requests:
              cpu: 200m
---
apiVersion: v1
kind: Service
metadata:
  name: app-fastapi
  labels:
    app: app-fastapi
spec:
  ports:
  - port: 8080
    protocol: TCP
  selector:
    app: app-fastapi
Apply the deployment:
kubectl apply -f app-fastapi-resources.yaml
Resource Requests: Set requests to typical usage and limits to maximum allowed. HPA uses the requests value as the baseline.

Create HPA

Via CLI

Quick way to create an HPA:
kubectl autoscale deployment app-fastapi \
  --cpu-percent=50 \
  --min=1 \
  --max=10

Via YAML

More control with declarative configuration:
app-fastapi-hpa.yaml
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: app-fastapi
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app-fastapi
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 50
Apply:
kubectl create -f app-fastapi-hpa.yaml

Monitor HPA

# View HPA status
kubectl get hpa

# Watch HPA in real-time
kubectl get hpa -w

# Detailed view
kubectl describe hpa app-fastapi
Example output:
NAME          REFERENCE                TARGETS   MINPODS   MAXPODS   REPLICAS
app-fastapi   Deployment/app-fastapi   23%/50%   1         10        2

Test Autoscaling

Generate load to trigger scaling:
# Port forward to service
kubectl port-forward svc/app-fastapi 8080:8080

# In another terminal, run load test
locust -f locustfile.py \
  --host=http://localhost:8080 \
  --users 100 \
  --spawn-rate 10 \
  --headless \
  --run-time 5m
Watch pods scale:
watch kubectl get pods -l app=app-fastapi

Advanced HPA (v2)

The v2 API supports multiple metrics and custom metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-fastapi-advanced
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app-fastapi
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 4
        periodSeconds: 30
      selectPolicy: Max
Key Features:
  • Multiple metrics (CPU + Memory)
  • Custom scaling behavior
  • Stabilization windows to prevent flapping
  • Control scale up/down rates

KNative Autoscaling

KNative provides advanced autoscaling features for KServe inference services:
  • Scale to zero: Reduce costs during idle periods
  • Concurrency-based: Scale based on concurrent requests
  • Faster response: Sub-second scaling decisions
  • RPS-based: Scale based on requests per second

Prerequisites

KNative is installed with KServe. Verify:
kubectl get pods -n knative-serving

KServe with Autoscaling

kserve-inferenceserver-autoscaling.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: custom-model-autoscaling
spec:
  predictor:
    scaleTarget: 1
    scaleMetric: concurrency  
    containers:
      - name: kserve-container
        image: ghcr.io/kyryl-opens-ml/app-kserve:latest
        imagePullPolicy: Always
        env:
        - name: WANDB_API_KEY
          valueFrom:
            secretKeyRef:
              name: wandb
              key: WANDB_API_KEY
Deploy:
kubectl create -f kserve-inferenceserver-autoscaling.yaml

Autoscaling Parameters

Add annotations to fine-tune autoscaling:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: custom-model-autoscaling
  annotations:
    autoscaling.knative.dev/target: "10"
    autoscaling.knative.dev/metric: "concurrency"
    autoscaling.knative.dev/min-scale: "1"
    autoscaling.knative.dev/max-scale: "10"
    autoscaling.knative.dev/scale-down-delay: "30s"
spec:
  predictor:
    containers:
      - name: kserve-container
        image: ghcr.io/kyryl-opens-ml/app-kserve:latest
Common Annotations:
AnnotationDescriptionDefault
autoscaling.knative.dev/targetTarget metric value100
autoscaling.knative.dev/metricMetric type (concurrency, rps, cpu, memory)concurrency
autoscaling.knative.dev/min-scaleMinimum replicas (0 = scale to zero)0
autoscaling.knative.dev/max-scaleMaximum replicas0 (unlimited)
autoscaling.knative.dev/scale-down-delayDelay before scaling down0s
autoscaling.knative.dev/scale-to-zero-pod-retention-periodTime before terminating idle pod0s
autoscaling.knative.dev/windowAggregation window60s

Test KNative Autoscaling

Generate concurrent requests:
seq 1 1000 | xargs -n1 -P10 -I {} curl -v \
  -H "Host: custom-model-autoscaling.default.example.com" \
  -H "Content-Type: application/json" \
  "http://localhost:8080/v1/models/custom-model:predict" \
  -d @input.json
Monitor scaling:
watch kubectl get pods -l serving.kserve.io/inferenceservice=custom-model-autoscaling

Scale to Zero

For services with intermittent traffic, enable scale to zero:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: custom-model-scale-to-zero
  annotations:
    autoscaling.knative.dev/min-scale: "0"
    autoscaling.knative.dev/scale-to-zero-pod-retention-period: "5m"
spec:
  predictor:
    containers:
      - name: kserve-container
        image: ghcr.io/kyryl-opens-ml/app-kserve:latest
Cold Start Penalty: Scale to zero introduces latency on first request. Balance cost savings vs. user experience.

Scaling Strategies

When to Use HPA

  • Kubernetes-native deployments (not KServe)
  • CPU/Memory-based scaling sufficient
  • Steady traffic patterns
  • Simple setup requirements

When to Use KNative

  • Using KServe for inference
  • Need scale-to-zero capability
  • Bursty traffic patterns
  • Want concurrency-based scaling
  • Need faster scaling decisions

Comparison

FeatureHPAKNative
Scale MetricCPU, Memory, CustomConcurrency, RPS, CPU, Memory
Scale to ZeroNoYes
Scaling Speed30s - 60sSub-second
Setup ComplexitySimpleModerate
Kubernetes NativeYesRequires KNative
Best ForGeneral workloadsServerless ML inference

Best Practices

Set Conservative Limits

Start with conservative min/max replicas and adjust based on observed patterns

Monitor Costs

Track infrastructure costs as scaling increases resource usage

Use Stabilization Windows

Prevent thrashing by configuring appropriate stabilization periods

Load Test

Test autoscaling behavior under load before production
For ML inference services:
# Conservative settings
minReplicas: 2  # High availability
maxReplicas: 10
targetCPUUtilization: 70%  # Leave headroom

# Aggressive cost optimization
minReplicas: 0  # Scale to zero
maxReplicas: 20
autoscaling.knative.dev/target: "10"  # 10 concurrent requests

Troubleshooting

Cause: Metrics server not running or resource requests not setSolution:
# Check metrics server
kubectl get pods -n kube-system | grep metrics-server

# Verify metrics available
kubectl top pods

# Ensure resources are set in deployment
kubectl describe deployment app-fastapi
Cause: Target threshold not reached or max replicas hitSolution:
# Check current metrics
kubectl get hpa

# View events
kubectl describe hpa app-fastapi

# Verify load is reaching pods
kubectl logs -l app=app-fastapi
Cause: Metrics fluctuating around threshold (flapping)Solution: Add stabilization windows in HPA v2:
behavior:
  scaleDown:
    stabilizationWindowSeconds: 300
  scaleUp:
    stabilizationWindowSeconds: 60
Cause: Min scale not set to 0 or retention period too longSolution:
annotations:
  autoscaling.knative.dev/min-scale: "0"
  autoscaling.knative.dev/scale-to-zero-pod-retention-period: "30s"

Additional Resources

Kubernetes HPA

Official HPA documentation

HPA Walkthrough

Step-by-step HPA tutorial

KServe Autoscaling

KNative autoscaling for KServe

KEDA

Kubernetes Event-driven Autoscaling

Next Steps

Async Inference

Learn how to decouple inference from API calls using message queues

Build docs developers (and LLMs) love