Skip to main content

Overview

This module covers deploying machine learning models through various serving approaches:
  • API Serving: Build FastAPI endpoints with proper validation and error handling
  • Web UIs: Create interactive Streamlit interfaces for single and batch predictions
  • Inference Servers: Deploy production-grade servers like Triton and KServe
  • LLM Serving: Serve large language models with vLLM and dynamic LoRA adapter loading

Key Concepts

Model Predictor

All serving implementations use a shared Predictor class that handles model loading and inference:
serving/predictor.py
class Predictor:
    def __init__(self, model_load_path: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_load_path)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_load_path)
        self.model.eval()

    @torch.no_grad()
    def predict(self, text: List[str]):
        text_encoded = self.tokenizer.batch_encode_plus(
            list(text), return_tensors="pt", padding=True
        )
        bert_outputs = self.model(**text_encoded).logits
        return softmax(bert_outputs, dim=-1).numpy()

    @classmethod
    def default_from_model_registry(cls) -> "Predictor":
        with FileLock(MODEL_LOCK):
            if not (Path(MODEL_PATH) / "model.safetensors").exists():
                load_from_registry(model_name=MODEL_ID, model_path=MODEL_PATH)
        return cls(model_load_path=MODEL_PATH)
Key features:
  • Downloads models from W&B registry on first use
  • Thread-safe loading with file locks
  • Batch prediction support
  • Returns probability distributions

Serving Options Comparison

ApproachUse CaseComplexityPerformance
FastAPIREST APIs, microservicesLowGood
StreamlitInternal tools, demosVery LowModerate
TritonHigh-throughput inferenceHighExcellent
KServeCloud-native deploymentHighExcellent
vLLMLLM serving with adaptersMediumExcellent

Learning Path

1

FastAPI Basics

Start with FastAPI to understand REST API fundamentals
  • Request/response models with Pydantic
  • Health checks and monitoring endpoints
  • Testing with TestClient
2

Interactive UIs

Build Streamlit apps for non-technical users
  • Single prediction interfaces
  • Batch processing with file uploads
  • Result visualization
3

Production Inference Servers

Deploy with Triton or KServe for production workloads
  • Model configuration and batching
  • Kubernetes deployment
  • Monitoring and logging
4

LLM Serving

Serve LLMs with vLLM and dynamic LoRA adapters
  • Runtime adapter loading
  • OpenAI-compatible API
  • Multi-model serving

Deployment Strategy

Local Development

# FastAPI
make run_fast_api

# Streamlit
make run_app_streamlit

# Triton
make run_pytriton

Kubernetes Deployment

All serving options include production-ready Kubernetes manifests:
# Create cluster
kind create cluster --name ml-in-production

# Deploy service
kubectl create -f k8s/app-fastapi.yaml
kubectl port-forward svc/app-fastapi 8080:8080

# Test endpoint
curl -X POST -H "Content-Type: application/json" \
  -d @data-samples/samples.json http://0.0.0.0:8080/predict

Practice Tasks

H9: API Serving

Build a Streamlit interface with:
  • Single prediction input
  • Batch CSV upload
  • Tests and CI integration
Alternative UI framework with:
  • Similar functionality to Streamlit
  • Component-based interface
  • Test coverage
Production API with:
  • Pydantic models
  • Health check endpoint
  • Comprehensive tests
Deploy to K8s with:
  • Deployment and Service manifests
  • ConfigMaps and Secrets
  • Rolling update strategy

H10: Inference Servers

Implement serving with:
  • Seldon Core API wrapper
  • KServe InferenceService
  • Triton model configuration
  • Ray Serve deployment
Advanced LLM deployment:
  • vLLM or TGI integration
  • LoRAX multi-adapter serving
  • ModalLab serverless deployment

Success Criteria

  • All serving implementations pass tests
  • APIs handle errors gracefully
  • Kubernetes deployments are stable
  • Documentation includes architecture decisions

Next Steps

FastAPI Serving

Build REST APIs with FastAPI

Streamlit UI

Create interactive web interfaces

Triton Server

Deploy with NVIDIA Triton

KServe

Cloud-native inference with KServe

Resources

Build docs developers (and LLMs) love