Practice Exercises

Overview

These practice exercises help you apply optimization techniques to real ML systems. Complete the tasks to gain hands-on experience with load testing, autoscaling, async inference, and quantization.

Key Learning Objectives

Implement dynamic batching and ensembles
Benchmark REST and gRPC performance
Apply quantization or pruning
Configure horizontal pod autoscaling
Implement async inference with queues

H11: Advanced Features & Benchmarking

Learning Resources

Required Reading

Advanced Topics

Triton & Seldon Features

Tasks

PR1: Dynamic Request Batching

Implement dynamic request batching for your model server.Options: Triton, Seldon, or KServeRequirements:

Configure batch size and timeout
Measure throughput improvement
Document configuration
Test with varying load patterns

Example (Triton):

{
  "dynamic_batching": {
    "preferred_batch_size": [8, 16],
    "max_queue_delay_microseconds": 100
  }
}

PR2: Model Ensemble

Create an ensemble of multiple models.Options: Triton, Seldon, or KServeRequirements:

Combine at least 2 models
Implement preprocessing/postprocessing
Measure end-to-end latency
Document ensemble logic

Example flow:

Input → Preprocessing → Model A → Postprocessing → Output
                     ↘ Model B ↗

PR3: gRPC Inference

Implement gRPC inference endpoint.Options: Triton, Seldon, or KServeRequirements:

Implement gRPC server
Create client example
Benchmark vs REST
Document protocol buffers

Compare:

Request/response size
Latency differences
Throughput improvements

PR4: Model Server Benchmarking

Comprehensive benchmarking of your model server.Metrics to report:

Latency (p50, p95, p99)
Throughput (RPS)
Error rate
Resource utilization (CPU, memory, GPU)

Tools: Use Locust, k6, or VegetaWorkloads:

Low load (10 users)
Medium load (50 users)
High load (100+ users)

PR5: REST vs gRPC Benchmark

Compare REST and gRPC performance.Requirements:

Same model, same hardware
Measure both protocols
Document differences
Provide recommendations

Expected findings:

Latency comparison
Throughput comparison
When to use each protocol

PR6: Component-Level Benchmarking

Profile inference by component.Break down:

Network latency
Preprocessing time
Model forward pass
Postprocessing time
Serialization overhead

Tools: Use Python profilers or custom timingIdentify bottlenecks and optimization opportunities

Deliverables

GitHub PRs

6 merged pull requests with working code

Performance Report

Update Google Doc with inference performance metrics and findings

H12: Scaling Infrastructure & Model

Learning Resources

Scaling & Autoscaling

Message Queues & Async

Hardware Accelerators

Model Optimization

Tasks

PR1: Horizontal Pod Autoscaling

Configure HPA for your model server pod.Requirements:

Set resource requests/limits
Configure HPA with CPU/memory targets
Test scaling under load
Document scaling behavior

Validation:

Pods scale up under load
Pods scale down when idle
Scaling happens within reasonable time
No service disruption during scaling

PR2: Async Inference with Queue

Implement asynchronous inference using message queue.Options: Kafka, SQS, RabbitMQ, RedisRequirements:

Submit job endpoint
Status check endpoint
Worker(s) processing queue
Results storage
Error handling

Architecture:

API → Queue → Worker → Model → Results DB

PR3: Model Optimization

Optimize inference speed using quantization, pruning, or distillation.Choose one or more:

Quantization: INT8, INT4, FP16
Pruning: Remove unimportant weights
Distillation: Train smaller model

Requirements:

Apply optimization technique
Measure performance improvement
Validate accuracy retention
Document trade-offs

PR4: Post-Optimization Benchmarking

Benchmark optimized model vs baseline.Compare:

Latency (p50, p95, p99)
Throughput (RPS)
Model size
Memory usage
Accuracy metrics
Cost per 1000 requests

Document:

Performance improvements
Accuracy impact
Cost savings
Recommendations

Deliverables

GitHub PRs

4 merged pull requests with working implementations

Optimization Report

Update Google Doc with model optimization results and performance comparisons

Evaluation Criteria

H11 Criteria

Code Quality
Performance
Documentation

Clean, documented code
Proper error handling
Production-ready configuration
Tests included

H12 Criteria

Functionality
Performance
Analysis

All features working
Autoscaling tested
Queue implementation complete
Optimization applied

Tips for Success

Start Simple

Get basic implementation working before adding complexity

Measure Everything

Collect metrics before and after each optimization

Document As You Go

Write documentation while implementing, not after

Test Realistically

Use production-like data and traffic patterns

Common Pitfalls

Avoid these mistakes:

Not setting resource requests - HPA requires resource requests to work
Testing on laptop - Performance characteristics differ from production
Ignoring accuracy - Always validate model accuracy after optimization
Over-optimizing early - Measure first, optimize bottlenecks second
Forgetting error cases - Test failure scenarios and recovery

Recommended Workflow

Baseline: Establish baseline metrics
Implement: Add feature/optimization
Measure: Benchmark new configuration
Compare: Analyze differences
Document: Record findings
Iterate: Refine based on results

Getting Help

Office Hours

Attend office hours for guidance and troubleshooting

Discussion Forum

Ask questions and share learnings with classmates

Documentation

Reference official docs for tools and frameworks

Examples

Check module source code for working examples

Submission

When you’re ready to submit:

Ensure all PRs are merged to your repository
Update the Google Doc with:
- Performance metrics and benchmarks
- Optimization results and analysis
- Architecture decisions and trade-offs
- Recommendations for production
Tag your submission with module-6-complete
Notify instructor via course platform

Ready to submit? Double-check you have:

✅ All required PRs merged
✅ Performance metrics documented
✅ Optimization results analyzed
✅ Clear recommendations provided

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

Overview

Key Learning Objectives

H11: Advanced Features & Benchmarking

Learning Resources

Tasks

Deliverables

GitHub PRs

Performance Report

H12: Scaling Infrastructure & Model

Learning Resources

Tasks

Deliverables

GitHub PRs

Optimization Report

Evaluation Criteria

H11 Criteria

H12 Criteria

Tips for Success

Start Simple

Measure Everything

Document As You Go

Test Realistically

Common Pitfalls

Recommended Workflow

Getting Help

Office Hours

Discussion Forum

Documentation

Examples

Submission

Build docs developers (and LLMs) love

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

​Overview

​Key Learning Objectives

​H11: Advanced Features & Benchmarking

​Learning Resources

​Tasks

​Deliverables

GitHub PRs

Performance Report

​H12: Scaling Infrastructure & Model

​Learning Resources

​Tasks

​Deliverables

GitHub PRs

Optimization Report

​Evaluation Criteria

​H11 Criteria

​H12 Criteria

​Tips for Success

Start Simple

Measure Everything

Document As You Go

Test Realistically

​Common Pitfalls

​Recommended Workflow

​Getting Help

Office Hours

Discussion Forum

Documentation

Examples

​Submission

Build docs developers (and LLMs) love

Overview

Key Learning Objectives

H11: Advanced Features & Benchmarking

Learning Resources

Tasks

Deliverables

H12: Scaling Infrastructure & Model

Learning Resources

Tasks

Deliverables

Evaluation Criteria

H11 Criteria

H12 Criteria

Tips for Success

Common Pitfalls

Recommended Workflow

Getting Help

Submission