Module 6: Optimization

Learn more about Mintlify

Enter your email to receive updates about new features and product releases.

Overview
Learning Objectives
Prerequisites
Module Structure
Setup
Create Kubernetes Cluster
Monitor with k9s
Configure Secrets
Deploy Services
Key Concepts
Performance Metrics
Optimization Trade-offs
Practice Assignments
Additional Resources

Overview

This module covers essential optimization techniques for production ML systems:

Load Testing: Benchmark your models with Locust, k6, and Vegeta
Autoscaling: Implement horizontal pod autoscaling with HPA and KNative
Async Inference: Decouple inference from API calls using message queues
Model Optimization: Reduce latency and costs with quantization

Premature optimization is the root of all evil! Always measure before optimizing.

Learning Objectives

By the end of this module, you will be able to:

Conduct comprehensive load testing on ML APIs
Configure horizontal pod autoscaling for ML workloads
Implement async inference patterns with queues
Apply quantization techniques to optimize model performance
Make data-driven decisions about infrastructure scaling

Prerequisites

Completed Module 5 (Model Serving)
Running Kubernetes cluster (kind or cloud provider)
Deployed ML model API
Basic understanding of HTTP load testing

Module Structure

Load Testing

Benchmark your APIs with Locust, k6, and Vegeta

Autoscaling

Configure HPA and KNative autoscaling

Async Inference

Implement queue-based inference patterns

Quantization

Optimize models with quantization techniques

Setup

Create Kubernetes Cluster

kind create cluster --name ml-in-production

Monitor with k9s

k9s -A

Configure Secrets

export WANDB_API_KEY='your-key-here'
kubectl create secret generic wandb --from-literal=WANDB_API_KEY=$WANDB_API_KEY

Deploy Services

Deploy the APIs from Module 5:

kubectl create -f ./k8s/app-fastapi.yaml
kubectl create -f ./k8s/app-triton.yaml
kubectl create -f ./k8s/app-streamlit.yaml
kubectl create -f ./k8s/kserve-inferenceserver.yaml

Key Concepts

Performance Metrics

When optimizing ML systems, track these key metrics:

Latency: Response time (p50, p95, p99)
Throughput: Requests per second (RPS)
Resource Utilization: CPU, memory, GPU usage
Cost: Infrastructure spend per 1000 requests
Accuracy: Model performance after optimization

Optimization Trade-offs

Every optimization involves trade-offs:

Technique	Latency	Throughput	Cost	Accuracy
Quantization	✅ Better	✅ Better	✅ Lower	⚠️ Slight drop
Autoscaling	➡️ Same	✅ Better	⚠️ Variable	➡️ Same
Async Inference	⚠️ Worse	✅ Better	✅ Lower	➡️ Same
Batching	⚠️ Worse	✅ Better	✅ Lower	➡️ Same

Practice Assignments

See the practice exercises for hands-on tasks.

Additional Resources

Horizontal Pod Autoscaling

Official Kubernetes HPA documentation

KServe Autoscaling

KNative autoscaling for model serving

TGI Quantization

HuggingFace Text Generation Inference quantization guide

vLLM Hardware Support

Supported quantization techniques by hardware

Module 5 Practice

Load Testing

⌘I

Build docs developers (and LLMs) love

Get started for free Talk to us

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 7: Monitoring

Module 8: Cloud Platforms

Module 6: Optimization

Overview

Learning Objectives

Prerequisites

Module Structure

Load Testing

Autoscaling

Async Inference

Quantization

Setup

Create Kubernetes Cluster

Monitor with k9s

Configure Secrets

Deploy Services

Key Concepts

Performance Metrics

Optimization Trade-offs

Practice Assignments

Additional Resources

Horizontal Pod Autoscaling

KServe Autoscaling

TGI Quantization

vLLM Hardware Support

Build docs developers (and LLMs) love

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

​Overview

​Learning Objectives

​Prerequisites

​Module Structure

Load Testing

Autoscaling

Async Inference

Quantization

​Setup

​Create Kubernetes Cluster

​Monitor with k9s

​Configure Secrets

​Deploy Services

​Key Concepts

​Performance Metrics

​Optimization Trade-offs

​Practice Assignments

​Additional Resources

Horizontal Pod Autoscaling

KServe Autoscaling

TGI Quantization

vLLM Hardware Support

Build docs developers (and LLMs) love

Overview

Learning Objectives

Prerequisites

Module Structure

Setup

Create Kubernetes Cluster

Monitor with k9s

Configure Secrets

Deploy Services

Key Concepts

Performance Metrics

Optimization Trade-offs

Practice Assignments

Additional Resources