Overview
This module includes two homework assignments focused on deploying ML models through various serving approaches:- H9: API and UI serving with FastAPI, Streamlit, and Gradio
- H10: Inference servers with Seldon, KServe, Triton, Ray, and vLLM
H9: API Serving
Learning Objectives
REST APIs
Build production-ready APIs with FastAPI
Web UIs
Create interactive interfaces with Streamlit/Gradio
Testing
Write comprehensive integration tests
Kubernetes
Deploy services to K8s with proper manifests
Reading List
Core Concepts
Core Concepts
API Design
API Design
UI Frameworks
UI Frameworks
Deployment
Deployment
Tasks
PR1: Streamlit UI
Objective: Create an interactive web UI for your modelRequirements:Testing:
- Single prediction interface with text input
- Batch prediction with CSV upload
- Unit tests for both interfaces
- CI integration (pytest in GitHub Actions)
serving/ui_app.py
PR2: Gradio UI
Objective: Build alternative UI with GradioRequirements:
- Similar functionality to Streamlit
- Component-based interface
- Tests with
gr.Interface.test_launch() - CI integration
PR3: FastAPI Server
Objective: Implement production-ready REST APIRequirements:Testing:
- Pydantic models for validation
/health_checkendpoint/predictendpoint with batch support- Comprehensive tests with TestClient
- CI integration
serving/fast_api.py
tests/test_fast_api.py
PR4: API Kubernetes Deployment
Objective: Deploy FastAPI to KubernetesRequirements:
- Deployment manifest with 2+ replicas
- Service manifest (ClusterIP)
- ConfigMaps for configuration
- Secrets for API keys (W&B)
- Resource limits/requests
k8s/app-fastapi.yaml
PR5: UI Kubernetes Deployment
Objective: Deploy Streamlit/Gradio to KubernetesRequirements:
- Deployment manifest (single replica for session state)
- Service manifest
- Ingress configuration (optional)
- Health checks
k8s/app-streamlit.yaml
Success Criteria
- 5 PRs merged with passing CI
- All tests pass (pytest, API tests, UI tests)
- Deployments run successfully on K8s
- Google doc includes serving architecture
H10: Inference Servers
Learning Objectives
Production Serving
Deploy with Seldon, KServe, and Triton
Performance
Optimize throughput with batching and GPUs
LLM Serving
Serve LLMs with vLLM and LoRA adapters
Comparison
Evaluate tradeoffs between solutions
Reading List
Inference Servers
Inference Servers
Cloud Platforms
Cloud Platforms
LLM Serving
LLM Serving
Edge Deployment
Edge Deployment
Tasks
PR1: Seldon API Deployment
Objective: Deploy model with Seldon CoreRequirements:
- Implement Seldon protocol wrapper
- Create SeldonDeployment manifest
- Write integration tests
- Document comparison with vanilla K8s deployment
serving/seldon_api.py
PR2: KServe API Integration
Objective: Deploy with KServe InferenceServiceRequirements:
- Implement KServe Model class
- Create InferenceService manifest
- Test V1/V2 inference protocol
- Configure autoscaling
PR3: Triton Inference Server
Objective: Deploy with NVIDIA TritonRequirements:
- Implement PyTriton wrapper
- Configure dynamic batching
- Create model configuration
- Write client tests
- Measure throughput improvements
PR4: Ray Deployment
Objective: Deploy with Ray ServeRequirements:
- Create Ray Serve deployment
- Configure replicas and resources
- Implement model batching
- Test auto-scaling behavior
PR5: LLM Deployment with vLLM (Optional)
Objective: Serve LLMs with vLLM and LoRA adaptersRequirements:
- Deploy vLLM server with base model
- Implement adapter loading client
- Create K8s manifest with GPU support
- Document adapter management workflow
PR6: Modal Deployment (Optional)
Objective: Deploy LLM on Modal serverless platformRequirements:
- Create Modal app definition
- Configure GPU resources
- Implement API endpoint
- Compare cost vs K8s deployment
Google Doc: Comparison Analysis
Objective: Compare serving solutions and justify choiceInclude:
- Feature comparison table
- Performance benchmarks (latency, throughput)
- Cost analysis (infrastructure, maintenance)
- Operational complexity
- Scaling characteristics
- Final recommendation with justification
- Setup complexity
- Performance (GPU utilization, latency)
- Scalability (autoscaling, multi-model)
- Monitoring and observability
- Ecosystem and community support
Success Criteria
- 6 PRs merged (4 required + 2 optional)
- All inference servers deploy successfully
- Tests pass for each implementation
- Google doc includes comprehensive comparison
- Final serving solution chosen with justification
Testing Checklist
API Testing
tests/test_endpoints.py
Kubernetes Testing
Performance Testing
Common Issues
Model loading fails
Model loading fails
Symptoms: Container crashes on startupSolutions:
- Check W&B credentials:
kubectl get secret wandb -o yaml - Verify model path:
kubectl exec <pod> -- ls /tmp/model - Increase memory limits in deployment
- Check logs:
kubectl logs <pod>
Predictions are slow
Predictions are slow
Symptoms: High latency (>1s for small inputs)Solutions:
- Enable batching in inference server
- Add GPU resources to deployment
- Use model quantization (INT8)
- Implement model caching
- Check CPU/memory throttling
Port forwarding fails
Port forwarding fails
Symptoms: Cannot connect to serviceSolutions:
- Verify service exists:
kubectl get svc - Check pod is running:
kubectl get pods - Use correct service port: Check manifest
- Try different local port:
kubectl port-forward svc/app 8081:8080
Submission Guidelines
Code Quality
- All tests pass locally and in CI
- Code follows project style (ruff format)
- No secrets committed to repository
- Dockerfiles build successfully
Documentation
- README explains how to run each service
- Kubernetes manifests have descriptive comments
- Google doc includes architecture diagrams
- API endpoints documented with examples
Resources
Documentation
- FastAPI Documentation
- Streamlit Documentation
- KServe Documentation
- Triton Documentation
- vLLM Documentation
Examples
Next Steps
Module 6: Monitoring
Learn to monitor models in production with metrics and alerts