Skip to main content
Scaling and optimization

Overview

This module covers essential optimization techniques for production ML systems:
  • Load Testing: Benchmark your models with Locust, k6, and Vegeta
  • Autoscaling: Implement horizontal pod autoscaling with HPA and KNative
  • Async Inference: Decouple inference from API calls using message queues
  • Model Optimization: Reduce latency and costs with quantization
Premature optimization is the root of all evil! Always measure before optimizing.

Learning Objectives

By the end of this module, you will be able to:
  1. Conduct comprehensive load testing on ML APIs
  2. Configure horizontal pod autoscaling for ML workloads
  3. Implement async inference patterns with queues
  4. Apply quantization techniques to optimize model performance
  5. Make data-driven decisions about infrastructure scaling

Prerequisites

  • Completed Module 5 (Model Serving)
  • Running Kubernetes cluster (kind or cloud provider)
  • Deployed ML model API
  • Basic understanding of HTTP load testing

Module Structure

Load Testing

Benchmark your APIs with Locust, k6, and Vegeta

Autoscaling

Configure HPA and KNative autoscaling

Async Inference

Implement queue-based inference patterns

Quantization

Optimize models with quantization techniques

Setup

Create Kubernetes Cluster

kind create cluster --name ml-in-production

Monitor with k9s

k9s -A

Configure Secrets

export WANDB_API_KEY='your-key-here'
kubectl create secret generic wandb --from-literal=WANDB_API_KEY=$WANDB_API_KEY

Deploy Services

Deploy the APIs from Module 5:
kubectl create -f ./k8s/app-fastapi.yaml
kubectl create -f ./k8s/app-triton.yaml
kubectl create -f ./k8s/app-streamlit.yaml
kubectl create -f ./k8s/kserve-inferenceserver.yaml

Key Concepts

Performance Metrics

When optimizing ML systems, track these key metrics:
  • Latency: Response time (p50, p95, p99)
  • Throughput: Requests per second (RPS)
  • Resource Utilization: CPU, memory, GPU usage
  • Cost: Infrastructure spend per 1000 requests
  • Accuracy: Model performance after optimization

Optimization Trade-offs

Every optimization involves trade-offs:
TechniqueLatencyThroughputCostAccuracy
Quantization✅ Better✅ Better✅ Lower⚠️ Slight drop
Autoscaling➡️ Same✅ Better⚠️ Variable➡️ Same
Async Inference⚠️ Worse✅ Better✅ Lower➡️ Same
Batching⚠️ Worse✅ Better✅ Lower➡️ Same

Practice Assignments

See the practice exercises for hands-on tasks.

Additional Resources

Horizontal Pod Autoscaling

Official Kubernetes HPA documentation

KServe Autoscaling

KNative autoscaling for model serving

TGI Quantization

HuggingFace Text Generation Inference quantization guide

vLLM Hardware Support

Supported quantization techniques by hardware

Build docs developers (and LLMs) love