Overview
These practice exercises help you apply optimization techniques to real ML systems. Complete the tasks to gain hands-on experience with load testing, autoscaling, async inference, and quantization.Key Learning Objectives
- Implement dynamic batching and ensembles
- Benchmark REST and gRPC performance
- Apply quantization or pruning
- Configure horizontal pod autoscaling
- Implement async inference with queues
H11: Advanced Features & Benchmarking
Learning Resources
Required Reading
Required Reading
Advanced Topics
Advanced Topics
Triton & Seldon Features
Triton & Seldon Features
Tasks
PR1: Dynamic Request Batching
Implement dynamic request batching for your model server.Options: Triton, Seldon, or KServeRequirements:
- Configure batch size and timeout
- Measure throughput improvement
- Document configuration
- Test with varying load patterns
PR2: Model Ensemble
Create an ensemble of multiple models.Options: Triton, Seldon, or KServeRequirements:
- Combine at least 2 models
- Implement preprocessing/postprocessing
- Measure end-to-end latency
- Document ensemble logic
PR3: gRPC Inference
Implement gRPC inference endpoint.Options: Triton, Seldon, or KServeRequirements:
- Implement gRPC server
- Create client example
- Benchmark vs REST
- Document protocol buffers
- Request/response size
- Latency differences
- Throughput improvements
PR4: Model Server Benchmarking
Comprehensive benchmarking of your model server.Metrics to report:
- Latency (p50, p95, p99)
- Throughput (RPS)
- Error rate
- Resource utilization (CPU, memory, GPU)
- Low load (10 users)
- Medium load (50 users)
- High load (100+ users)
PR5: REST vs gRPC Benchmark
Compare REST and gRPC performance.Requirements:
- Same model, same hardware
- Measure both protocols
- Document differences
- Provide recommendations
- Latency comparison
- Throughput comparison
- When to use each protocol
Deliverables
GitHub PRs
6 merged pull requests with working code
Performance Report
Update Google Doc with inference performance metrics and findings
H12: Scaling Infrastructure & Model
Learning Resources
Scaling & Autoscaling
Scaling & Autoscaling
Message Queues & Async
Message Queues & Async
Hardware Accelerators
Hardware Accelerators
Tasks
PR1: Horizontal Pod Autoscaling
Configure HPA for your model server pod.Requirements:
- Set resource requests/limits
- Configure HPA with CPU/memory targets
- Test scaling under load
- Document scaling behavior
- Pods scale up under load
- Pods scale down when idle
- Scaling happens within reasonable time
- No service disruption during scaling
PR2: Async Inference with Queue
Implement asynchronous inference using message queue.Options: Kafka, SQS, RabbitMQ, RedisRequirements:
- Submit job endpoint
- Status check endpoint
- Worker(s) processing queue
- Results storage
- Error handling
PR3: Model Optimization
Optimize inference speed using quantization, pruning, or distillation.Choose one or more:
- Quantization: INT8, INT4, FP16
- Pruning: Remove unimportant weights
- Distillation: Train smaller model
- Apply optimization technique
- Measure performance improvement
- Validate accuracy retention
- Document trade-offs
Deliverables
GitHub PRs
4 merged pull requests with working implementations
Optimization Report
Update Google Doc with model optimization results and performance comparisons
Evaluation Criteria
H11 Criteria
- Code Quality
- Performance
- Documentation
- Clean, documented code
- Proper error handling
- Production-ready configuration
- Tests included
H12 Criteria
- Functionality
- Performance
- Analysis
- All features working
- Autoscaling tested
- Queue implementation complete
- Optimization applied
Tips for Success
Start Simple
Get basic implementation working before adding complexity
Measure Everything
Collect metrics before and after each optimization
Document As You Go
Write documentation while implementing, not after
Test Realistically
Use production-like data and traffic patterns
Common Pitfalls
Recommended Workflow
- Baseline: Establish baseline metrics
- Implement: Add feature/optimization
- Measure: Benchmark new configuration
- Compare: Analyze differences
- Document: Record findings
- Iterate: Refine based on results
Getting Help
Office Hours
Attend office hours for guidance and troubleshooting
Discussion Forum
Ask questions and share learnings with classmates
Documentation
Reference official docs for tools and frameworks
Examples
Check module source code for working examples
Submission
When you’re ready to submit:- Ensure all PRs are merged to your repository
- Update the Google Doc with:
- Performance metrics and benchmarks
- Optimization results and analysis
- Architecture decisions and trade-offs
- Recommendations for production
- Tag your submission with
module-6-complete - Notify instructor via course platform
Ready to submit? Double-check you have:
- ✅ All required PRs merged
- ✅ Performance metrics documented
- ✅ Optimization results analyzed
- ✅ Clear recommendations provided