Overview
This module covers deploying machine learning models through various serving approaches:- API Serving: Build FastAPI endpoints with proper validation and error handling
- Web UIs: Create interactive Streamlit interfaces for single and batch predictions
- Inference Servers: Deploy production-grade servers like Triton and KServe
- LLM Serving: Serve large language models with vLLM and dynamic LoRA adapter loading
Key Concepts
Model Predictor
All serving implementations use a sharedPredictor class that handles model loading and inference:
serving/predictor.py
- Downloads models from W&B registry on first use
- Thread-safe loading with file locks
- Batch prediction support
- Returns probability distributions
Serving Options Comparison
| Approach | Use Case | Complexity | Performance |
|---|---|---|---|
| FastAPI | REST APIs, microservices | Low | Good |
| Streamlit | Internal tools, demos | Very Low | Moderate |
| Triton | High-throughput inference | High | Excellent |
| KServe | Cloud-native deployment | High | Excellent |
| vLLM | LLM serving with adapters | Medium | Excellent |
Learning Path
FastAPI Basics
Start with FastAPI to understand REST API fundamentals
- Request/response models with Pydantic
- Health checks and monitoring endpoints
- Testing with TestClient
Interactive UIs
Build Streamlit apps for non-technical users
- Single prediction interfaces
- Batch processing with file uploads
- Result visualization
Production Inference Servers
Deploy with Triton or KServe for production workloads
- Model configuration and batching
- Kubernetes deployment
- Monitoring and logging
Deployment Strategy
Local Development
Kubernetes Deployment
All serving options include production-ready Kubernetes manifests:Practice Tasks
H9: API Serving
PR1: Streamlit UI
PR1: Streamlit UI
Build a Streamlit interface with:
- Single prediction input
- Batch CSV upload
- Tests and CI integration
PR2: Gradio UI
PR2: Gradio UI
Alternative UI framework with:
- Similar functionality to Streamlit
- Component-based interface
- Test coverage
PR3: FastAPI Server
PR3: FastAPI Server
Production API with:
- Pydantic models
- Health check endpoint
- Comprehensive tests
PR4-5: Kubernetes Deployments
PR4-5: Kubernetes Deployments
Deploy to K8s with:
- Deployment and Service manifests
- ConfigMaps and Secrets
- Rolling update strategy
H10: Inference Servers
PR1-4: Inference Server Integration
PR1-4: Inference Server Integration
Implement serving with:
- Seldon Core API wrapper
- KServe InferenceService
- Triton model configuration
- Ray Serve deployment
PR5-6: LLM Serving (Optional)
PR5-6: LLM Serving (Optional)
Advanced LLM deployment:
- vLLM or TGI integration
- LoRAX multi-adapter serving
- ModalLab serverless deployment
Success Criteria
- All serving implementations pass tests
- APIs handle errors gracefully
- Kubernetes deployments are stable
- Documentation includes architecture decisions
Next Steps
FastAPI Serving
Build REST APIs with FastAPI
Streamlit UI
Create interactive web interfaces
Triton Server
Deploy with NVIDIA Triton
KServe
Cloud-native inference with KServe