Skip to main content

Overview

This module includes two homework assignments focused on deploying ML models through various serving approaches:
  • H9: API and UI serving with FastAPI, Streamlit, and Gradio
  • H10: Inference servers with Seldon, KServe, Triton, Ray, and vLLM

H9: API Serving

Learning Objectives

REST APIs

Build production-ready APIs with FastAPI

Web UIs

Create interactive interfaces with Streamlit/Gradio

Testing

Write comprehensive integration tests

Kubernetes

Deploy services to K8s with proper manifests

Reading List

Tasks

1

PR1: Streamlit UI

Objective: Create an interactive web UI for your modelRequirements:
  • Single prediction interface with text input
  • Batch prediction with CSV upload
  • Unit tests for both interfaces
  • CI integration (pytest in GitHub Actions)
Reference implementation:
serving/ui_app.py
import streamlit as st
from serving.predictor import Predictor

@st.cache_data
def get_model():
    return Predictor.default_from_model_registry()

def single_pred():
    input_sent = st.text_input("Type english sentence")
    if st.button("Run inference"):
        pred = predictor.predict([input_sent])
        st.write("Pred:", pred)
Testing:
from streamlit.testing.v1 import AppTest

def test_single_prediction():
    at = AppTest.from_file("serving/ui_app.py")
    at.run()
    at.text_input[0].set_value("test").run()
    at.button[0].click().run()
    assert "Pred:" in at.text[0].value
2

PR2: Gradio UI

Objective: Build alternative UI with GradioRequirements:
  • Similar functionality to Streamlit
  • Component-based interface
  • Tests with gr.Interface.test_launch()
  • CI integration
Example:
import gradio as gr
from serving.predictor import Predictor

predictor = Predictor.default_from_model_registry()

def predict(text):
    return predictor.predict([text])[0].tolist()

interface = gr.Interface(
    fn=predict,
    inputs=gr.Textbox(label="Input text"),
    outputs=gr.Label(label="Predictions")
)

if __name__ == "__main__":
    interface.launch()
3

PR3: FastAPI Server

Objective: Implement production-ready REST APIRequirements:
  • Pydantic models for validation
  • /health_check endpoint
  • /predict endpoint with batch support
  • Comprehensive tests with TestClient
  • CI integration
Reference:
serving/fast_api.py
from fastapi import FastAPI
from pydantic import BaseModel

class Payload(BaseModel):
    text: List[str]

app = FastAPI()

@app.get("/health_check")
def health_check() -> str:
    return "ok"

@app.post("/predict")
def predict(payload: Payload):
    prediction = predictor.predict(text=payload.text)
    return {"probs": prediction.tolist()}
Testing:
tests/test_fast_api.py
from fastapi.testclient import TestClient

def test_predict():
    response = client.post("/predict", json={"text": ["test"]})
    assert response.status_code == 200
    assert len(response.json()["probs"][0]) == 2
4

PR4: API Kubernetes Deployment

Objective: Deploy FastAPI to KubernetesRequirements:
  • Deployment manifest with 2+ replicas
  • Service manifest (ClusterIP)
  • ConfigMaps for configuration
  • Secrets for API keys (W&B)
  • Resource limits/requests
Example:
k8s/app-fastapi.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-fastapi
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: app-fastapi
        image: your-registry/app-fastapi:latest
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
5

PR5: UI Kubernetes Deployment

Objective: Deploy Streamlit/Gradio to KubernetesRequirements:
  • Deployment manifest (single replica for session state)
  • Service manifest
  • Ingress configuration (optional)
  • Health checks
Example:
k8s/app-streamlit.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-streamlit
spec:
  replicas: 1  # Single replica for session state
  template:
    spec:
      containers:
      - name: app-streamlit
        image: your-registry/app-streamlit:latest
        livenessProbe:
          httpGet:
            path: /_stcore/health
            port: 8080
6

Google Doc Update

Objective: Document model serving planInclude:
  • API design decisions (endpoints, formats)
  • UI/UX considerations
  • Deployment architecture
  • Scaling strategy
  • Monitoring plan
  • Tradeoffs between serving options

Success Criteria

  • 5 PRs merged with passing CI
  • All tests pass (pytest, API tests, UI tests)
  • Deployments run successfully on K8s
  • Google doc includes serving architecture

H10: Inference Servers

Learning Objectives

Production Serving

Deploy with Seldon, KServe, and Triton

Performance

Optimize throughput with batching and GPUs

LLM Serving

Serve LLMs with vLLM and LoRA adapters

Comparison

Evaluate tradeoffs between solutions

Reading List

Tasks

1

PR1: Seldon API Deployment

Objective: Deploy model with Seldon CoreRequirements:
  • Implement Seldon protocol wrapper
  • Create SeldonDeployment manifest
  • Write integration tests
  • Document comparison with vanilla K8s deployment
Example:
serving/seldon_api.py
class SeldonModel:
    def __init__(self):
        self.predictor = Predictor.default_from_model_registry()
    
    def predict(self, X, features_names=None):
        # X is numpy array or list
        predictions = self.predictor.predict(X)
        return predictions
2

PR2: KServe API Integration

Objective: Deploy with KServe InferenceServiceRequirements:
  • Implement KServe Model class
  • Create InferenceService manifest
  • Test V1/V2 inference protocol
  • Configure autoscaling
Reference: See KServe documentation
3

PR3: Triton Inference Server

Objective: Deploy with NVIDIA TritonRequirements:
  • Implement PyTriton wrapper
  • Configure dynamic batching
  • Create model configuration
  • Write client tests
  • Measure throughput improvements
Reference: See Triton documentation
4

PR4: Ray Deployment

Objective: Deploy with Ray ServeRequirements:
  • Create Ray Serve deployment
  • Configure replicas and resources
  • Implement model batching
  • Test auto-scaling behavior
Example:
from ray import serve

@serve.deployment(num_replicas=2)
class ModelDeployment:
    def __init__(self):
        self.predictor = Predictor.default_from_model_registry()
    
    async def __call__(self, request):
        text = await request.json()
        predictions = self.predictor.predict(text["instances"])
        return {"predictions": predictions.tolist()}
5

PR5: LLM Deployment with vLLM (Optional)

Objective: Serve LLMs with vLLM and LoRA adaptersRequirements:
  • Deploy vLLM server with base model
  • Implement adapter loading client
  • Create K8s manifest with GPU support
  • Document adapter management workflow
Reference: See vLLM documentation
6

PR6: Modal Deployment (Optional)

Objective: Deploy LLM on Modal serverless platformRequirements:
  • Create Modal app definition
  • Configure GPU resources
  • Implement API endpoint
  • Compare cost vs K8s deployment
Example:
import modal

stub = modal.Stub("llm-inference")

@stub.function(
    gpu="A10G",
    image=modal.Image.debian_slim().pip_install("vllm")
)
def generate(prompt: str) -> str:
    from vllm import LLM
    llm = LLM("microsoft/Phi-3-mini-4k-instruct")
    outputs = llm.generate([prompt])
    return outputs[0].outputs[0].text
7

Google Doc: Comparison Analysis

Objective: Compare serving solutions and justify choiceInclude:
  • Feature comparison table
  • Performance benchmarks (latency, throughput)
  • Cost analysis (infrastructure, maintenance)
  • Operational complexity
  • Scaling characteristics
  • Final recommendation with justification
Comparison dimensions:
  • Setup complexity
  • Performance (GPU utilization, latency)
  • Scalability (autoscaling, multi-model)
  • Monitoring and observability
  • Ecosystem and community support

Success Criteria

  • 6 PRs merged (4 required + 2 optional)
  • All inference servers deploy successfully
  • Tests pass for each implementation
  • Google doc includes comprehensive comparison
  • Final serving solution chosen with justification

Testing Checklist

API Testing

tests/test_endpoints.py
import pytest
from fastapi.testclient import TestClient

def test_health_check():
    """Verify service is running"""
    response = client.get("/health_check")
    assert response.status_code == 200

def test_predict_single():
    """Test single prediction"""
    response = client.post("/predict", json={"text": ["test"]})
    assert response.status_code == 200
    assert "probs" in response.json()

def test_predict_batch():
    """Test batch prediction"""
    response = client.post("/predict", json={"text": ["test1", "test2"]})
    assert len(response.json()["probs"]) == 2

def test_invalid_input():
    """Test error handling"""
    response = client.post("/predict", json={"invalid": "data"})
    assert response.status_code == 422

Kubernetes Testing

# Deployment health
kubectl get deployments
kubectl describe deployment app-fastapi

# Pod status
kubectl get pods -l app=app-fastapi
kubectl logs -l app=app-fastapi

# Service connectivity
kubectl get services
kubectl port-forward svc/app-fastapi 8080:8080
curl http://localhost:8080/health_check

# Resource usage
kubectl top pods -l app=app-fastapi

Performance Testing

import time
import statistics

def benchmark_latency(endpoint: str, n_requests: int = 100):
    latencies = []
    for _ in range(n_requests):
        start = time.time()
        response = requests.post(endpoint, json={"text": ["test"]})
        latencies.append(time.time() - start)
    
    print(f"Mean latency: {statistics.mean(latencies):.3f}s")
    print(f"P95 latency: {statistics.quantiles(latencies, n=20)[18]:.3f}s")
    print(f"P99 latency: {statistics.quantiles(latencies, n=100)[98]:.3f}s")

Common Issues

Symptoms: Container crashes on startupSolutions:
  • Check W&B credentials: kubectl get secret wandb -o yaml
  • Verify model path: kubectl exec <pod> -- ls /tmp/model
  • Increase memory limits in deployment
  • Check logs: kubectl logs <pod>
Symptoms: High latency (>1s for small inputs)Solutions:
  • Enable batching in inference server
  • Add GPU resources to deployment
  • Use model quantization (INT8)
  • Implement model caching
  • Check CPU/memory throttling
Symptoms: Cannot connect to serviceSolutions:
  • Verify service exists: kubectl get svc
  • Check pod is running: kubectl get pods
  • Use correct service port: Check manifest
  • Try different local port: kubectl port-forward svc/app 8081:8080

Submission Guidelines

1

Code Quality

  • All tests pass locally and in CI
  • Code follows project style (ruff format)
  • No secrets committed to repository
  • Dockerfiles build successfully
2

Documentation

  • README explains how to run each service
  • Kubernetes manifests have descriptive comments
  • Google doc includes architecture diagrams
  • API endpoints documented with examples
3

Pull Requests

  • Title format: [module-5] <description>
  • PR description explains changes
  • Screenshots of running services
  • Links to deployed endpoints (if applicable)

Resources

Documentation

Examples

Next Steps

Module 6: Monitoring

Learn to monitor models in production with metrics and alerts

Build docs developers (and LLMs) love