Module 5: Model Serving

Overview

This module covers deploying machine learning models through various serving approaches:

API Serving: Build FastAPI endpoints with proper validation and error handling
Web UIs: Create interactive Streamlit interfaces for single and batch predictions
Inference Servers: Deploy production-grade servers like Triton and KServe
LLM Serving: Serve large language models with vLLM and dynamic LoRA adapter loading

Key Concepts

Model Predictor

All serving implementations use a shared Predictor class that handles model loading and inference:

serving/predictor.py

class Predictor:
    def __init__(self, model_load_path: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_load_path)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_load_path)
        self.model.eval()

    @torch.no_grad()
    def predict(self, text: List[str]):
        text_encoded = self.tokenizer.batch_encode_plus(
            list(text), return_tensors="pt", padding=True
        )
        bert_outputs = self.model(**text_encoded).logits
        return softmax(bert_outputs, dim=-1).numpy()

    @classmethod
    def default_from_model_registry(cls) -> "Predictor":
        with FileLock(MODEL_LOCK):
            if not (Path(MODEL_PATH) / "model.safetensors").exists():
                load_from_registry(model_name=MODEL_ID, model_path=MODEL_PATH)
        return cls(model_load_path=MODEL_PATH)

Key features:

Downloads models from W&B registry on first use
Thread-safe loading with file locks
Batch prediction support
Returns probability distributions

Serving Options Comparison

Approach	Use Case	Complexity	Performance
FastAPI	REST APIs, microservices	Low	Good
Streamlit	Internal tools, demos	Very Low	Moderate
Triton	High-throughput inference	High	Excellent
KServe	Cloud-native deployment	High	Excellent
vLLM	LLM serving with adapters	Medium	Excellent

Learning Path

FastAPI Basics

Start with FastAPI to understand REST API fundamentals

Request/response models with Pydantic
Health checks and monitoring endpoints
Testing with TestClient

Interactive UIs

Build Streamlit apps for non-technical users

Single prediction interfaces
Batch processing with file uploads
Result visualization

Production Inference Servers

Deploy with Triton or KServe for production workloads

Model configuration and batching
Kubernetes deployment
Monitoring and logging

LLM Serving

Serve LLMs with vLLM and dynamic LoRA adapters

Runtime adapter loading
OpenAI-compatible API
Multi-model serving

Deployment Strategy

Local Development

# FastAPI
make run_fast_api

# Streamlit
make run_app_streamlit

# Triton
make run_pytriton

Kubernetes Deployment

All serving options include production-ready Kubernetes manifests:

# Create cluster
kind create cluster --name ml-in-production

# Deploy service
kubectl create -f k8s/app-fastapi.yaml
kubectl port-forward svc/app-fastapi 8080:8080

# Test endpoint
curl -X POST -H "Content-Type: application/json" \
  -d @data-samples/samples.json http://0.0.0.0:8080/predict

Practice Tasks

H9: API Serving

PR1: Streamlit UI

Build a Streamlit interface with:

Single prediction input
Batch CSV upload
Tests and CI integration

PR2: Gradio UI

Alternative UI framework with:

Similar functionality to Streamlit
Component-based interface
Test coverage

PR3: FastAPI Server

Production API with:

Pydantic models
Health check endpoint
Comprehensive tests

PR4-5: Kubernetes Deployments

Deploy to K8s with:

Deployment and Service manifests
ConfigMaps and Secrets
Rolling update strategy

H10: Inference Servers

PR1-4: Inference Server Integration

Implement serving with:

Seldon Core API wrapper
KServe InferenceService
Triton model configuration
Ray Serve deployment

PR5-6: LLM Serving (Optional)

Advanced LLM deployment:

vLLM or TGI integration
LoRAX multi-adapter serving
ModalLab serverless deployment

Success Criteria

All serving implementations pass tests
APIs handle errors gracefully
Kubernetes deployments are stable
Documentation includes architecture decisions

Next Steps

FastAPI Serving

Build REST APIs with FastAPI

Streamlit UI

Create interactive web interfaces

Triton Server

Deploy with NVIDIA Triton

KServe

Cloud-native inference with KServe

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

Module 5: Model Serving

Overview

Key Concepts

Model Predictor

Serving Options Comparison

Learning Path

Deployment Strategy

Local Development

Kubernetes Deployment

Practice Tasks

H9: API Serving

H10: Inference Servers

Success Criteria

Next Steps

FastAPI Serving

Streamlit UI

Triton Server

KServe

Resources

Build docs developers (and LLMs) love

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

​Overview

​Key Concepts

​Model Predictor

​Serving Options Comparison

​Learning Path

​Deployment Strategy

​Local Development

​Kubernetes Deployment

​Practice Tasks

​H9: API Serving

​H10: Inference Servers

​Success Criteria

​Next Steps

FastAPI Serving

Streamlit UI

Triton Server

KServe

​Resources

Build docs developers (and LLMs) love

Overview

Key Concepts

Model Predictor

Serving Options Comparison

Learning Path

Deployment Strategy

Local Development

Kubernetes Deployment

Practice Tasks

H9: API Serving

H10: Inference Servers

Success Criteria

Next Steps

Resources