Skip to main content

Overview

These practice exercises help you apply optimization techniques to real ML systems. Complete the tasks to gain hands-on experience with load testing, autoscaling, async inference, and quantization.

Key Learning Objectives

  • Implement dynamic batching and ensembles
  • Benchmark REST and gRPC performance
  • Apply quantization or pruning
  • Configure horizontal pod autoscaling
  • Implement async inference with queues

H11: Advanced Features & Benchmarking

Learning Resources

Tasks

1

PR1: Dynamic Request Batching

Implement dynamic request batching for your model server.Options: Triton, Seldon, or KServeRequirements:
  • Configure batch size and timeout
  • Measure throughput improvement
  • Document configuration
  • Test with varying load patterns
Example (Triton):
{
  "dynamic_batching": {
    "preferred_batch_size": [8, 16],
    "max_queue_delay_microseconds": 100
  }
}
2

PR2: Model Ensemble

Create an ensemble of multiple models.Options: Triton, Seldon, or KServeRequirements:
  • Combine at least 2 models
  • Implement preprocessing/postprocessing
  • Measure end-to-end latency
  • Document ensemble logic
Example flow:
Input → Preprocessing → Model A → Postprocessing → Output
                     ↘ Model B ↗
3

PR3: gRPC Inference

Implement gRPC inference endpoint.Options: Triton, Seldon, or KServeRequirements:
  • Implement gRPC server
  • Create client example
  • Benchmark vs REST
  • Document protocol buffers
Compare:
  • Request/response size
  • Latency differences
  • Throughput improvements
4

PR4: Model Server Benchmarking

Comprehensive benchmarking of your model server.Metrics to report:
  • Latency (p50, p95, p99)
  • Throughput (RPS)
  • Error rate
  • Resource utilization (CPU, memory, GPU)
Tools: Use Locust, k6, or VegetaWorkloads:
  • Low load (10 users)
  • Medium load (50 users)
  • High load (100+ users)
5

PR5: REST vs gRPC Benchmark

Compare REST and gRPC performance.Requirements:
  • Same model, same hardware
  • Measure both protocols
  • Document differences
  • Provide recommendations
Expected findings:
  • Latency comparison
  • Throughput comparison
  • When to use each protocol
6

PR6: Component-Level Benchmarking

Profile inference by component.Break down:
  • Network latency
  • Preprocessing time
  • Model forward pass
  • Postprocessing time
  • Serialization overhead
Tools: Use Python profilers or custom timingIdentify bottlenecks and optimization opportunities

Deliverables

GitHub PRs

6 merged pull requests with working code

Performance Report

Update Google Doc with inference performance metrics and findings

H12: Scaling Infrastructure & Model

Learning Resources

Tasks

1

PR1: Horizontal Pod Autoscaling

Configure HPA for your model server pod.Requirements:
  • Set resource requests/limits
  • Configure HPA with CPU/memory targets
  • Test scaling under load
  • Document scaling behavior
Validation:
  • Pods scale up under load
  • Pods scale down when idle
  • Scaling happens within reasonable time
  • No service disruption during scaling
2

PR2: Async Inference with Queue

Implement asynchronous inference using message queue.Options: Kafka, SQS, RabbitMQ, RedisRequirements:
  • Submit job endpoint
  • Status check endpoint
  • Worker(s) processing queue
  • Results storage
  • Error handling
Architecture:
API → Queue → Worker → Model → Results DB
3

PR3: Model Optimization

Optimize inference speed using quantization, pruning, or distillation.Choose one or more:
  • Quantization: INT8, INT4, FP16
  • Pruning: Remove unimportant weights
  • Distillation: Train smaller model
Requirements:
  • Apply optimization technique
  • Measure performance improvement
  • Validate accuracy retention
  • Document trade-offs
4

PR4: Post-Optimization Benchmarking

Benchmark optimized model vs baseline.Compare:
  • Latency (p50, p95, p99)
  • Throughput (RPS)
  • Model size
  • Memory usage
  • Accuracy metrics
  • Cost per 1000 requests
Document:
  • Performance improvements
  • Accuracy impact
  • Cost savings
  • Recommendations

Deliverables

GitHub PRs

4 merged pull requests with working implementations

Optimization Report

Update Google Doc with model optimization results and performance comparisons

Evaluation Criteria

H11 Criteria

  • Clean, documented code
  • Proper error handling
  • Production-ready configuration
  • Tests included

H12 Criteria

  • All features working
  • Autoscaling tested
  • Queue implementation complete
  • Optimization applied

Tips for Success

Start Simple

Get basic implementation working before adding complexity

Measure Everything

Collect metrics before and after each optimization

Document As You Go

Write documentation while implementing, not after

Test Realistically

Use production-like data and traffic patterns

Common Pitfalls

Avoid these mistakes:
  1. Not setting resource requests - HPA requires resource requests to work
  2. Testing on laptop - Performance characteristics differ from production
  3. Ignoring accuracy - Always validate model accuracy after optimization
  4. Over-optimizing early - Measure first, optimize bottlenecks second
  5. Forgetting error cases - Test failure scenarios and recovery
  1. Baseline: Establish baseline metrics
  2. Implement: Add feature/optimization
  3. Measure: Benchmark new configuration
  4. Compare: Analyze differences
  5. Document: Record findings
  6. Iterate: Refine based on results

Getting Help

Office Hours

Attend office hours for guidance and troubleshooting

Discussion Forum

Ask questions and share learnings with classmates

Documentation

Reference official docs for tools and frameworks

Examples

Check module source code for working examples

Submission

When you’re ready to submit:
  1. Ensure all PRs are merged to your repository
  2. Update the Google Doc with:
    • Performance metrics and benchmarks
    • Optimization results and analysis
    • Architecture decisions and trade-offs
    • Recommendations for production
  3. Tag your submission with module-6-complete
  4. Notify instructor via course platform
Ready to submit? Double-check you have:
  • ✅ All required PRs merged
  • ✅ Performance metrics documented
  • ✅ Optimization results analyzed
  • ✅ Clear recommendations provided

Build docs developers (and LLMs) love