Model Optimization

Why Optimization Matters

“Premature optimization is the root of all evil” — Donald Knuth That said, once your system is working, optimization becomes critical:

Cost: GPU inference is expensive (A10G = $1.50/hour)
Latency: Users expect sub-second responses
Throughput: More requests per GPU = better economics
Accessibility: Smaller models run on cheaper hardware

The goal: Maintain accuracy while reducing cost/latency.

Benchmarking First

Always measure before optimizing:

Profile inference: Where is time spent? (model forward pass, preprocessing, postprocessing)
Establish baseline: Current throughput, latency, and cost
Load test: Use realistic traffic patterns
Monitor resources: GPU/CPU utilization, memory

Locust

Python-based load testing, great for ML APIs

K6

Go-based, better metrics and dashboards

Vegeta

CLI for quick HTTP load tests

ghz

gRPC load testing (for Triton)

Run load tests in a staging environment that mirrors production. Don’t trust local benchmarks.

Quantization

Quantization reduces precision (32-bit → 8-bit or 4-bit), trading slight accuracy for speed and memory: Module 6 Benchmark (Phi-3.5, 100 concurrent users):

Method	Median Latency	GPU Memory	Accuracy Impact
FP32 (baseline)	5600ms	12 GB	—
FP16	5000ms	6 GB	Negligible
FP8	5000ms	3 GB	Less than 1% drop
8-bit (LLM.int8)	13000ms	3 GB	Less than 2% drop
4-bit NF4	8500ms	2 GB	2-5% drop

FP8 and FP16 are almost always safe. They’re hardware-accelerated on modern GPUs (A100, H100) and have minimal accuracy impact.

Quantization Methods

EETQ

8-bit quantization optimized for inference (fast, low memory)

bitsandbytes

LLM.int8 and 4-bit NF4/FP4 (good accuracy, slower)

GPTQ

Post-training quantization with calibration dataset

AWQ

Activation-aware quantization (better accuracy than GPTQ)

Using vLLM with quantization:

vllm serve microsoft/Phi-3-mini-4k-instruct --quantize fp8

Using bitsandbytes in code:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    'microsoft/Phi-3-mini-4k-instruct',
    quantization_config=quant_config
)

For LLMs, start with FP8 or FP16. Only drop to 4-bit if you’re memory-constrained. Always validate accuracy on your domain-specific eval set.

Horizontal Pod Autoscaling (HPA)

HPA automatically scales replicas based on load:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-fastapi-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app-fastapi
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

K8s monitors CPU and adds/removes pods to maintain 50% average utilization. Setup:

Install metrics-server: kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Set resource requests in your Deployment
Create HPA

For GPU workloads, use custom metrics (requests per second, queue depth) instead of CPU. GPU utilization isn’t exposed by default metrics-server.

KNative Autoscaling

KServe uses KNative for scale-to-zero:

metadata:
  annotations:
    autoscaling.knative.dev/minScale: "0"  # Scale to zero
    autoscaling.knative.dev/maxScale: "10"
    autoscaling.knative.dev/target: "100"  # Target concurrency

When idle, KNative terminates pods (saving costs). First request has cold start (~10s), but subsequent requests are fast.

Scale-to-zero is great for development or low-traffic models. For production APIs, set minScale: 1 to avoid cold starts.

Vertical Pod Autoscaling (VPA)

VPA adjusts resource requests automatically:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-fastapi-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app-fastapi
  updatePolicy:
    updateMode: "Auto"  # Restart pods with new limits

VPA observes usage and recommends (or applies) new resource requests.

Use HPA for horizontal scaling (more replicas) and VPA for vertical scaling (bigger pods). Don’t use both on the same metric—they’ll fight each other.

Model Compression

Distillation

Train a smaller “student” model to mimic a larger “teacher”:

teacher_output = teacher_model(inputs)
student_output = student_model(inputs)

# Distillation loss
loss = KL_divergence(student_output, teacher_output.detach())

Examples:

DistilBERT: 40% smaller than BERT, retains 97% accuracy
DistilWhisper: 6x faster speech recognition
TinyLlama: 1.1B parameters, trained on 3T tokens

Distillation is powerful but requires significant compute for training. Quantization is faster to apply and often sufficient.

Pruning

Remove unimportant weights:

from torch.nn.utils import prune

# Remove 30% of weights with lowest magnitude
prune.l1_unstructured(model.layer1, name='weight', amount=0.3)
prune.remove(model.layer1, 'weight')  # Make permanent

Structured pruning removes entire filters/channels (better for hardware acceleration). Libraries:

Neural Compressor (Intel): Quantization + pruning + distillation
SparseML (Neural Magic): Sparsity + quantization for CPUs
PyTorch native: torch.nn.utils.prune

Pruning is most effective when combined with fine-tuning. Prune → fine-tune → repeat.

TensorRT for Maximum Performance

TensorRT (NVIDIA) compiles models for optimized inference:

import tensorrt as trt
import torch

# Export to ONNX first
torch.onnx.export(model, dummy_input, 'model.onnx')

# Convert to TensorRT
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(...)
parser = trt.OnnxParser(network, logger)
with open('model.onnx', 'rb') as f:
    parser.parse(f.read())

engine = builder.build_cuda_engine(network)

Optimizations:

Kernel fusion (combine ops)
Precision calibration (INT8 quantization with minimal accuracy loss)
Layer-specific tuning

Speedups:

2-5x faster than PyTorch on same GPU
Essential for production at scale

TensorRT is complex to set up. Use vLLM or Triton (which use TensorRT under the hood) for easier integration.

Batching Strategies

Static Batching

Wait for N requests, then process as batch:

queue = []

while True:
    queue.append(await get_request())
    if len(queue) >= batch_size or time_since_last_batch > max_wait:
        outputs = model(batch(queue))
        send_responses(outputs)
        queue.clear()

Trade-offs:

Larger batch = better throughput, worse latency for first request
Smaller batch = lower latency, worse GPU utilization

Continuous Batching (vLLM)

vLLM dynamically adds/removes requests from batch:

Time 0: [Req1: token1, Req2: token1, Req3: token1]
Time 1: [Req1: token2, Req2: token2, Req3: token2]  # Req3 finishes
Time 2: [Req1: token3, Req2: token3, Req4: token1]  # New request added

This maximizes GPU utilization without increasing latency.

For LLMs, continuous batching improves throughput by 10-20x compared to sequential processing.

Async Inference for Long Jobs

For tasks > 30s:

Client pushes job to queue (SQS, Redis)
Workers poll queue and process
Results stored in DB
Client polls for result

Benefits:

Client doesn’t time out
Workers can be auto-scaled independently
Failed jobs can be retried

Example with Modal:

import modal

stub = modal.Stub('async-inference')

@stub.function(gpu='A10G')
async def process_job(job_id, data):
    result = model.predict(data)
    db.save(job_id, result)

@stub.local_entrypoint()
def submit(data):
    job_id = str(uuid.uuid4())
    process_job.spawn(job_id, data)  # Async call
    return job_id

Hands-On Examples

Explore optimization in Module 6:

Load test FastAPI and Triton with Locust/K6
Benchmark quantization methods (FP8, 4-bit, 8-bit)
Set up HPA with metrics-server
Implement async inference with SQS
Use KServe autoscaling

Getting Started

Core Concepts

Why Optimization Matters

Benchmarking First

Locust

K6

Vegeta

ghz

Quantization

Quantization Methods

EETQ

bitsandbytes

GPTQ

AWQ

Horizontal Pod Autoscaling (HPA)

KNative Autoscaling

Vertical Pod Autoscaling (VPA)

Model Compression

Distillation

Pruning

TensorRT for Maximum Performance

Batching Strategies

Static Batching

Continuous Batching (vLLM)

Async Inference for Long Jobs

Hands-On Examples

Next Steps

Monitoring

Production Patterns

Further Reading

Build docs developers (and LLMs) love

Getting Started

Core Concepts

​Why Optimization Matters

​Benchmarking First

Locust

K6

Vegeta

ghz

​Quantization

​Quantization Methods

EETQ

bitsandbytes

GPTQ

AWQ

​Horizontal Pod Autoscaling (HPA)

​KNative Autoscaling

​Vertical Pod Autoscaling (VPA)

​Model Compression

​Distillation

​Pruning

​TensorRT for Maximum Performance

​Batching Strategies

​Static Batching

​Continuous Batching (vLLM)

​Async Inference for Long Jobs

​Hands-On Examples

​Next Steps

Monitoring

Production Patterns

​Further Reading

Build docs developers (and LLMs) love

Why Optimization Matters

Benchmarking First

Quantization

Quantization Methods

Horizontal Pod Autoscaling (HPA)

KNative Autoscaling

Vertical Pod Autoscaling (VPA)

Model Compression

Distillation

Pruning

TensorRT for Maximum Performance

Batching Strategies

Static Batching

Continuous Batching (vLLM)

Async Inference for Long Jobs

Hands-On Examples

Next Steps

Further Reading