Production Patterns

Buy vs Build

The most important decision in ML infrastructure: Should we build or buy?

Building gives you control and customization. Buying gives you speed and support. Neither is universally better.

Decision Framework

Build when:

You have unique requirements (custom hardware, proprietary algorithms)
Cost at scale justifies engineering investment (>$500K/year cloud spend)
You have a strong ML platform team (5+ engineers)
Vendor lock-in is unacceptable

Buy when:

Time-to-market is critical (startup, new product)
Small team (fewer than 5 engineers)
Standard use cases (image classification, NER, embeddings)
Need SLA and support

The Build vs Buy Spectrum

Full Build ←―――――――――――――――――――――――――→ Full Buy
  K8s + OSS      Managed K8s       Platform Services     Pure SaaS
(Kubeflow+MLflow) (EKS+SageMaker)   (Vertex AI)         (OpenAI API)

Most companies end up in the middle: managed infrastructure + open-source tools.

Open Source Stack

Pro: No vendor lock-in, full controlCon: Requires deep expertise, ongoing maintenanceExample: Self-hosted K8s, Kubeflow, MLflow, Prometheus

Managed Platform

Pro: Fast setup, integrated services, SLACon: Vendor lock-in, less flexibilityExample: AWS SageMaker, Google Vertex AI, Azure ML

Serverless

Pro: Zero ops, pay-per-use, scales to zeroCon: Cold starts, vendor-specific APIsExample: Modal, Replicate, Banana, Baseten

API Services

Pro: No infrastructure, state-of-the-art modelsCon: Expensive at scale, data privacy concernsExample: OpenAI, Anthropic, Cohere

Real-World Stack Examples

Startup (Seed Stage)

Goal: Ship fast, minimize ops Stack:

Compute: Modal or Railway (serverless)
Data: Postgres (Supabase) + S3
Experiments: Weights & Biases
Serving: Modal.com or Replicate
Monitoring: Sentry + built-in metrics

Why:

No Kubernetes complexity
Pay-per-use (cost scales with revenue)
Focus on product, not infrastructure

Many successful companies stay on this stack for years. Don’t over-engineer early.

Mid-Size Company (Series A-B)

Goal: Control costs, improve reliability Stack:

Infra: Managed Kubernetes (EKS/GKE)
Data: S3 + Snowflake/BigQuery
Experiments: W&B + MLflow
Pipelines: Airflow (Astronomer)
Serving: FastAPI on K8s + KServe
Monitoring: Prometheus + Grafana + Datadog

Why:

Managed services reduce ops burden
K8s for flexibility without full self-hosting
Mix of open-source and commercial tools

Large Enterprise

Goal: Scale, compliance, multi-tenancy Stack:

Infra: Self-hosted K8s (on-prem or cloud)
Data: Data lake (Delta Lake) + feature store (Feast/Tecton)
Experiments: MLflow + custom platform
Pipelines: Airflow or Kubeflow
Serving: Custom inference framework + Triton
Monitoring: Prometheus + Grafana + Seldon

Why:

Full control for compliance (HIPAA, GDPR)
Cost optimization at scale
Custom features (multi-tenancy, chargeback)

AWS Example

A production ML system on AWS: Components:

Data: S3 for raw data, Athena for queries, RDS for metadata
Processing: SageMaker Processing Jobs (Spark/Pandas at scale)
Training: EC2 with GPUs + SageMaker Training Jobs
Pipelines: MWAA (Managed Airflow)
Serving: SageMaker Multi-Model Endpoints
Monitoring: CloudWatch + SageMaker Model Monitor

SageMaker Multi-Model Endpoints:

import boto3

sagemaker = boto3.client('sagemaker')

# Create endpoint
sagemaker.create_endpoint(
    EndpointName='multi-model-endpoint',
    EndpointConfigName='multi-model-config'
)

# Upload models to S3
s3.upload_file('model1.tar.gz', 'bucket', 'models/model1.tar.gz')
s3.upload_file('model2.tar.gz', 'bucket', 'models/model2.tar.gz')

# Invoke specific model
runtime = boto3.client('sagemaker-runtime')
response = runtime.invoke_endpoint(
    EndpointName='multi-model-endpoint',
    TargetModel='model1.tar.gz',  # Specify which model
    Body=json.dumps({'inputs': [...]})
)

Why Multi-Model Endpoints?

Share instance across 100s of models
Models loaded on-demand (saves memory)
Cost-effective for many small models

SageMaker is great for teams already on AWS. For multi-cloud or open-source preference, use EKS + Kubeflow.

GCP Example (Vertex AI)

Components:

Data: GCS + BigQuery
Processing: Dataflow (Apache Beam)
Training: Vertex AI Training (managed)
Pipelines: Vertex AI Pipelines (Kubeflow-based)
Serving: Vertex AI Prediction
Monitoring: Cloud Monitoring + Vertex AI Model Monitoring

Vertex AI simplifies Kubeflow:

from google.cloud import aiplatform

aiplatform.init(project='my-project', location='us-central1')

job = aiplatform.CustomTrainingJob(
    display_name='bert-training',
    container_uri='gcr.io/my-project/trainer:v1',
)

model = job.run(
    replica_count=4,
    machine_type='n1-standard-8',
    accelerator_type='NVIDIA_TESLA_T4',
    accelerator_count=1
)

endpoint = model.deploy(machine_type='n1-standard-4')

Vertex AI handles:

Distributed training
Model versioning
A/B testing
Drift detection

Vertex AI is the most “batteries-included” cloud ML platform. Use it if you’re all-in on GCP.

Common Patterns

Hybrid Serving

Problem: Some models need GPUs, others don’t. Running all on GPU is wasteful. Solution:

API Gateway
  ├─> Lightweight models (CPU, FastAPI)
  └─> Heavy models (GPU, Triton)

Route based on model type:

from fastapi import FastAPI
import httpx

app = FastAPI()

@app.post('/predict')
async def predict(request: Request):
    if request.model in ['bert', 'gpt']:
        # Route to GPU cluster
        return await httpx.post('http://gpu-cluster/predict', json=request.dict())
    else:
        # Handle locally (CPU)
        return model.predict(request.inputs)

Feature Store

Problem: Training uses batch features, serving needs real-time features. Code diverges. Solution: Centralized feature store (Feast, Tecton)

from feast import FeatureStore

store = FeatureStore(repo_path='.')

# Training: batch features
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=['user_features:age', 'user_features:country']
).to_df()

# Serving: online features
features = store.get_online_features(
    features=['user_features:age', 'user_features:country'],
    entity_rows=[{'user_id': 123}]
).to_dict()

Feature store ensures consistency.

Feature stores are essential for large teams (>20 data scientists) but overkill for small projects. Start simple.

Shadow Mode

Problem: New model needs validation before replacing old one. Solution: Run both, compare predictions

@app.post('/predict')
def predict(request):
    # Production model
    prod_result = model_v1.predict(request.inputs)
    
    # Shadow model (logged, not returned)
    shadow_result = model_v2.predict(request.inputs)
    log_comparison(prod_result, shadow_result)
    
    return prod_result  # Only return prod

After validating shadow model performs well, swap it in.

A/B Testing

import random

@app.post('/predict')
def predict(request):
    user_id = request.user_id
    if hash(user_id) % 100 < 10:  # 10% traffic
        result = model_v2.predict(request.inputs)
        variant = 'B'
    else:
        result = model_v1.predict(request.inputs)
        variant = 'A'
    
    log_prediction(user_id, result, variant)
    return result

Track metrics (accuracy, latency, user engagement) by variant.

Tech Radar

Module 8 includes a tech radar (inspired by Thoughtworks):

Adopt: Proven, safe for production
Trial: Worth trying in non-critical projects
Assess: Interesting but not ready
Hold: Avoid or phase out

Examples:

Adopt: FastAPI, Kubernetes, Prometheus, W&B
Trial: Dagster, vLLM, Modal
Assess: MLflow Model Registry, Seldon Core v2
Hold: TensorFlow (prefer PyTorch), Airflow 1.x

Tech radars are opinionated. Build your own based on team experience and requirements.

Hands-On Examples

Explore production patterns in Module 8:

Deploy multi-model endpoints on SageMaker
Understand buy vs build trade-offs
Review real-world architecture examples
Build a custom tech radar

Next Steps

Course Overview

Review all concepts

Module 1

Start with containerization

Getting Started

Core Concepts

Buy vs Build

Decision Framework

The Build vs Buy Spectrum

Open Source Stack

Managed Platform

Serverless

API Services

Real-World Stack Examples

Startup (Seed Stage)

Mid-Size Company (Series A-B)

Large Enterprise

AWS Example

GCP Example (Vertex AI)

Common Patterns

Hybrid Serving

Feature Store

Shadow Mode

A/B Testing

Tech Radar

Hands-On Examples

Next Steps

Course Overview

Module 1

Further Reading

Build docs developers (and LLMs) love

Getting Started

Core Concepts

​Buy vs Build

​Decision Framework

​The Build vs Buy Spectrum

Open Source Stack

Managed Platform

Serverless

API Services

​Real-World Stack Examples

​Startup (Seed Stage)

​Mid-Size Company (Series A-B)

​Large Enterprise

​AWS Example

​GCP Example (Vertex AI)

​Common Patterns

​Hybrid Serving

​Feature Store

​Shadow Mode

​A/B Testing

​Tech Radar

​Hands-On Examples

​Next Steps

Course Overview

Module 1

​Further Reading

Build docs developers (and LLMs) love

Buy vs Build

Decision Framework

The Build vs Buy Spectrum

Real-World Stack Examples

Startup (Seed Stage)

Mid-Size Company (Series A-B)

Large Enterprise

AWS Example

GCP Example (Vertex AI)

Common Patterns

Hybrid Serving

Feature Store

Shadow Mode

A/B Testing

Tech Radar

Hands-On Examples

Next Steps

Further Reading