Why Optimization Matters
“Premature optimization is the root of all evil” — Donald Knuth That said, once your system is working, optimization becomes critical:- Cost: GPU inference is expensive (A10G = $1.50/hour)
- Latency: Users expect sub-second responses
- Throughput: More requests per GPU = better economics
- Accessibility: Smaller models run on cheaper hardware
Benchmarking First
Always measure before optimizing:- Profile inference: Where is time spent? (model forward pass, preprocessing, postprocessing)
- Establish baseline: Current throughput, latency, and cost
- Load test: Use realistic traffic patterns
- Monitor resources: GPU/CPU utilization, memory
Locust
Python-based load testing, great for ML APIs
K6
Go-based, better metrics and dashboards
Vegeta
CLI for quick HTTP load tests
ghz
gRPC load testing (for Triton)
Run load tests in a staging environment that mirrors production. Don’t trust local benchmarks.
Quantization
Quantization reduces precision (32-bit → 8-bit or 4-bit), trading slight accuracy for speed and memory: Module 6 Benchmark (Phi-3.5, 100 concurrent users):| Method | Median Latency | GPU Memory | Accuracy Impact |
|---|---|---|---|
| FP32 (baseline) | 5600ms | 12 GB | — |
| FP16 | 5000ms | 6 GB | Negligible |
| FP8 | 5000ms | 3 GB | Less than 1% drop |
| 8-bit (LLM.int8) | 13000ms | 3 GB | Less than 2% drop |
| 4-bit NF4 | 8500ms | 2 GB | 2-5% drop |
FP8 and FP16 are almost always safe. They’re hardware-accelerated on modern GPUs (A100, H100) and have minimal accuracy impact.
Quantization Methods
EETQ
8-bit quantization optimized for inference (fast, low memory)
bitsandbytes
LLM.int8 and 4-bit NF4/FP4 (good accuracy, slower)
GPTQ
Post-training quantization with calibration dataset
AWQ
Activation-aware quantization (better accuracy than GPTQ)
For LLMs, start with FP8 or FP16. Only drop to 4-bit if you’re memory-constrained. Always validate accuracy on your domain-specific eval set.
Horizontal Pod Autoscaling (HPA)
HPA automatically scales replicas based on load:- Install metrics-server:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml - Set resource requests in your Deployment
- Create HPA
For GPU workloads, use custom metrics (requests per second, queue depth) instead of CPU. GPU utilization isn’t exposed by default metrics-server.
KNative Autoscaling
KServe uses KNative for scale-to-zero:Scale-to-zero is great for development or low-traffic models. For production APIs, set
minScale: 1 to avoid cold starts.Vertical Pod Autoscaling (VPA)
VPA adjusts resource requests automatically:Use HPA for horizontal scaling (more replicas) and VPA for vertical scaling (bigger pods). Don’t use both on the same metric—they’ll fight each other.
Model Compression
Distillation
Train a smaller “student” model to mimic a larger “teacher”:- DistilBERT: 40% smaller than BERT, retains 97% accuracy
- DistilWhisper: 6x faster speech recognition
- TinyLlama: 1.1B parameters, trained on 3T tokens
Distillation is powerful but requires significant compute for training. Quantization is faster to apply and often sufficient.
Pruning
Remove unimportant weights:- Neural Compressor (Intel): Quantization + pruning + distillation
- SparseML (Neural Magic): Sparsity + quantization for CPUs
- PyTorch native:
torch.nn.utils.prune
Pruning is most effective when combined with fine-tuning. Prune → fine-tune → repeat.
TensorRT for Maximum Performance
TensorRT (NVIDIA) compiles models for optimized inference:- Kernel fusion (combine ops)
- Precision calibration (INT8 quantization with minimal accuracy loss)
- Layer-specific tuning
- 2-5x faster than PyTorch on same GPU
- Essential for production at scale
TensorRT is complex to set up. Use vLLM or Triton (which use TensorRT under the hood) for easier integration.
Batching Strategies
Static Batching
Wait for N requests, then process as batch:- Larger batch = better throughput, worse latency for first request
- Smaller batch = lower latency, worse GPU utilization
Continuous Batching (vLLM)
vLLM dynamically adds/removes requests from batch:For LLMs, continuous batching improves throughput by 10-20x compared to sequential processing.
Async Inference for Long Jobs
For tasks > 30s:- Client pushes job to queue (SQS, Redis)
- Workers poll queue and process
- Results stored in DB
- Client polls for result
- Client doesn’t time out
- Workers can be auto-scaled independently
- Failed jobs can be retried
Hands-On Examples
Explore optimization in Module 6:- Load test FastAPI and Triton with Locust/K6
- Benchmark quantization methods (FP8, 4-bit, 8-bit)
- Set up HPA with metrics-server
- Implement async inference with SQS
- Use KServe autoscaling
Next Steps
Monitoring
Track optimization impact
Production Patterns
Combine techniques for real systems