Buy vs Build
The most important decision in ML infrastructure: Should we build or buy?Building gives you control and customization. Buying gives you speed and support. Neither is universally better.
Decision Framework
Build when:- You have unique requirements (custom hardware, proprietary algorithms)
- Cost at scale justifies engineering investment (>$500K/year cloud spend)
- You have a strong ML platform team (5+ engineers)
- Vendor lock-in is unacceptable
- Time-to-market is critical (startup, new product)
- Small team (fewer than 5 engineers)
- Standard use cases (image classification, NER, embeddings)
- Need SLA and support
The Build vs Buy Spectrum
Open Source Stack
Pro: No vendor lock-in, full controlCon: Requires deep expertise, ongoing maintenanceExample: Self-hosted K8s, Kubeflow, MLflow, Prometheus
Managed Platform
Pro: Fast setup, integrated services, SLACon: Vendor lock-in, less flexibilityExample: AWS SageMaker, Google Vertex AI, Azure ML
Serverless
Pro: Zero ops, pay-per-use, scales to zeroCon: Cold starts, vendor-specific APIsExample: Modal, Replicate, Banana, Baseten
API Services
Pro: No infrastructure, state-of-the-art modelsCon: Expensive at scale, data privacy concernsExample: OpenAI, Anthropic, Cohere
Real-World Stack Examples
Startup (Seed Stage)
Goal: Ship fast, minimize ops Stack:- Compute: Modal or Railway (serverless)
- Data: Postgres (Supabase) + S3
- Experiments: Weights & Biases
- Serving: Modal.com or Replicate
- Monitoring: Sentry + built-in metrics
- No Kubernetes complexity
- Pay-per-use (cost scales with revenue)
- Focus on product, not infrastructure
Many successful companies stay on this stack for years. Don’t over-engineer early.
Mid-Size Company (Series A-B)
Goal: Control costs, improve reliability Stack:- Infra: Managed Kubernetes (EKS/GKE)
- Data: S3 + Snowflake/BigQuery
- Experiments: W&B + MLflow
- Pipelines: Airflow (Astronomer)
- Serving: FastAPI on K8s + KServe
- Monitoring: Prometheus + Grafana + Datadog
- Managed services reduce ops burden
- K8s for flexibility without full self-hosting
- Mix of open-source and commercial tools
Large Enterprise
Goal: Scale, compliance, multi-tenancy Stack:- Infra: Self-hosted K8s (on-prem or cloud)
- Data: Data lake (Delta Lake) + feature store (Feast/Tecton)
- Experiments: MLflow + custom platform
- Pipelines: Airflow or Kubeflow
- Serving: Custom inference framework + Triton
- Monitoring: Prometheus + Grafana + Seldon
- Full control for compliance (HIPAA, GDPR)
- Cost optimization at scale
- Custom features (multi-tenancy, chargeback)
AWS Example
A production ML system on AWS: Components:- Data: S3 for raw data, Athena for queries, RDS for metadata
- Processing: SageMaker Processing Jobs (Spark/Pandas at scale)
- Training: EC2 with GPUs + SageMaker Training Jobs
- Pipelines: MWAA (Managed Airflow)
- Serving: SageMaker Multi-Model Endpoints
- Monitoring: CloudWatch + SageMaker Model Monitor
- Share instance across 100s of models
- Models loaded on-demand (saves memory)
- Cost-effective for many small models
SageMaker is great for teams already on AWS. For multi-cloud or open-source preference, use EKS + Kubeflow.
GCP Example (Vertex AI)
Components:- Data: GCS + BigQuery
- Processing: Dataflow (Apache Beam)
- Training: Vertex AI Training (managed)
- Pipelines: Vertex AI Pipelines (Kubeflow-based)
- Serving: Vertex AI Prediction
- Monitoring: Cloud Monitoring + Vertex AI Model Monitoring
- Distributed training
- Model versioning
- A/B testing
- Drift detection
Vertex AI is the most “batteries-included” cloud ML platform. Use it if you’re all-in on GCP.
Common Patterns
Hybrid Serving
Problem: Some models need GPUs, others don’t. Running all on GPU is wasteful. Solution:Feature Store
Problem: Training uses batch features, serving needs real-time features. Code diverges. Solution: Centralized feature store (Feast, Tecton)Feature stores are essential for large teams (>20 data scientists) but overkill for small projects. Start simple.
Shadow Mode
Problem: New model needs validation before replacing old one. Solution: Run both, compare predictionsA/B Testing
Tech Radar
Module 8 includes a tech radar (inspired by Thoughtworks):- Adopt: Proven, safe for production
- Trial: Worth trying in non-critical projects
- Assess: Interesting but not ready
- Hold: Avoid or phase out
- Adopt: FastAPI, Kubernetes, Prometheus, W&B
- Trial: Dagster, vLLM, Modal
- Assess: MLflow Model Registry, Seldon Core v2
- Hold: TensorFlow (prefer PyTorch), Airflow 1.x
Tech radars are opinionated. Build your own based on team experience and requirements.
Hands-On Examples
Explore production patterns in Module 8:- Deploy multi-model endpoints on SageMaker
- Understand buy vs build trade-offs
- Review real-world architecture examples
- Build a custom tech radar
Next Steps
Course Overview
Review all concepts
Module 1
Start with containerization