Skip to main content

Overview

FastAPI provides a modern, high-performance framework for building ML model APIs with automatic validation, documentation, and type safety.

Implementation

API Structure

The FastAPI server (serving/fast_api.py) implements two endpoints:
serving/fast_api.py
from fastapi import FastAPI
from pydantic import BaseModel
from serving.predictor import Predictor

class Payload(BaseModel):
    text: List[str]

class Prediction(BaseModel):
    probs: List[List[float]]

app = FastAPI()
predictor = Predictor.default_from_model_registry()

@app.get("/health_check")
def health_check() -> str:
    return "ok"

@app.post("/predict", response_model=Prediction)
def predict(payload: Payload) -> Prediction:
    prediction = predictor.predict(text=payload.text)
    return Prediction(probs=prediction.tolist())

Request/Response Models

{
  "text": ["good", "bad"]
}
Payload schema:
  • text: List of strings to classify
  • Validated by Pydantic at runtime
  • Automatic error messages for invalid input
Prediction schema:
  • probs: List of probability distributions
  • Each inner list sums to 1.0
  • Length matches number of input texts

API Endpoints

Health Check

GET /health_check
Purpose: Kubernetes liveness/readiness probes Response:
"ok"
Usage:
curl http://localhost:8080/health_check

Predict

POST /predict
Purpose: Classify text sequences Request body:
{
  "text": ["This is great!", "This is terrible."]
}
Response:
{
  "probs": [
    [0.05, 0.95],
    [0.92, 0.08]
  ]
}
Error handling:
  • 422 Unprocessable Entity: Invalid input format
  • 500 Internal Server Error: Model prediction failure

Testing

Tests use FastAPI’s TestClient for integration testing:
tests/test_fast_api.py
import pytest
from fastapi.testclient import TestClient
from serving.fast_api import app

client = TestClient(app)

def test_health_check():
    response = client.get("/health_check")
    assert response.status_code == 200
    assert response.json() == "ok"

def test_predict():
    response = client.post("/predict", json={"text": ["this is test"]})
    assert response.status_code == 200
    probs = response.json()["probs"][0]
    assert len(probs) == 2
    assert sum(probs) == pytest.approx(1.0)
Test coverage:
  • Health check endpoint
  • Prediction endpoint with validation
  • Probability distribution validation
Run tests:
pytest -ss ./tests

Local Development

Using Make

# Build and run
make run_fast_api
This:
  1. Builds Docker image with app-fastapi target
  2. Runs container on port 8081
  3. Mounts W&B API key from environment

Using Docker Directly

# Build
docker build -f Dockerfile -t app-fastapi:latest --target app-fastapi .

# Run
docker run -it -p 8081:8080 \
  -e WANDB_API_KEY=${WANDB_API_KEY} \
  app-fastapi:latest

Manual Testing

# Test with sample data
curl -X POST -H "Content-Type: application/json" \
  -d @data-samples/samples.json \
  http://0.0.0.0:8080/predict

# Expected output
{
  "probs": [
    [0.23, 0.77],
    [0.89, 0.11]
  ]
}

Kubernetes Deployment

Manifest Structure

k8s/app-fastapi.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-fastapi
spec:
  replicas: 2
  selector:
    matchLabels:
      app: app-fastapi
  template:
    metadata:
      labels:
        app: app-fastapi
    spec:
      containers:
        - name: app-fastapi
          image: ghcr.io/kyryl-opens-ml/app-fastapi:latest
          env:
          - name: WANDB_API_KEY
            valueFrom:
              secretKeyRef:
                name: wandb
                key: WANDB_API_KEY
---
apiVersion: v1
kind: Service
metadata:
  name: app-fastapi
spec:
  ports:
  - port: 8080
    protocol: TCP
  selector:
    app: app-fastapi
Key configuration:
  • Replicas: 2 pods for high availability
  • Image: Pulled from GitHub Container Registry
  • Secrets: W&B API key from Kubernetes secret
  • Service: ClusterIP exposes port 8080

Deployment Steps

1

Create cluster

kind create cluster --name ml-in-production
2

Create secrets

export WANDB_API_KEY='your-key-here'
kubectl create secret generic wandb \
  --from-literal=WANDB_API_KEY=$WANDB_API_KEY
3

Deploy application

kubectl create -f k8s/app-fastapi.yaml
4

Verify deployment

kubectl get pods -l app=app-fastapi
kubectl logs -l app=app-fastapi
5

Port forward

kubectl port-forward --address 0.0.0.0 svc/app-fastapi 8080:8080

Testing in Kubernetes

# Health check
curl http://localhost:8080/health_check

# Prediction
curl -X POST -H "Content-Type: application/json" \
  -d '{"text": ["test input"]}' \
  http://localhost:8080/predict

Production Considerations

Performance

The model loads on startup. For faster cold starts, consider:
  • Model caching in persistent volumes
  • Init containers for model download
  • Warm-up requests after deployment
Optimization strategies:
  • Use uvicorn workers for concurrency
  • Enable model batching for throughput
  • Add Redis for response caching
  • Implement request queuing

Monitoring

Add observability with middleware:
from fastapi import Request
import time

@app.middleware("http")
async def add_process_time_header(request: Request, call_next):
    start_time = time.time()
    response = await call_next(request)
    process_time = time.time() - start_time
    response.headers["X-Process-Time"] = str(process_time)
    return response
Metrics to track:
  • Request latency (p50, p95, p99)
  • Throughput (requests/second)
  • Error rate (4xx, 5xx)
  • Model inference time

Error Handling

Enhance error responses:
from fastapi import HTTPException

@app.post("/predict", response_model=Prediction)
def predict(payload: Payload) -> Prediction:
    try:
        prediction = predictor.predict(text=payload.text)
        return Prediction(probs=prediction.tolist())
    except Exception as e:
        raise HTTPException(
            status_code=500,
            detail=f"Prediction failed: {str(e)}"
        )

API Documentation

FastAPI automatically generates docs:
  • Swagger UI: http://localhost:8080/docs
  • ReDoc: http://localhost:8080/redoc
  • OpenAPI spec: http://localhost:8080/openapi.json

Best Practices

Validation

Use Pydantic models for all inputs/outputs

Versioning

Version APIs with path prefixes (/v1/predict)

Rate Limiting

Add slowapi for request throttling

Authentication

Implement API keys or OAuth for security

Comparison with Alternatives

FeatureFastAPIFlaskDjango
PerformanceExcellentGoodModerate
Type SafetyYesNoPartial
Auto DocsYesNoPartial
Async SupportYesLimitedYes
Learning CurveLowVery LowHigh

Next Steps

Streamlit UI

Build interactive web interfaces with Streamlit

Resources

Build docs developers (and LLMs) love