Why Orchestration Matters
ML systems are more than just training scripts. Production pipelines involve:- Data ingestion from multiple sources
- Preprocessing and feature engineering
- Training (often with hyperparameter sweeps)
- Evaluation and model comparison
- Deployment to staging/production
- Monitoring and retraining triggers
What is a DAG?
Orchestration tools represent workflows as Directed Acyclic Graphs (DAGs):Platform Comparison
Airflow
Best for: General-purpose workflows, batch jobsPros: Battle-tested, huge ecosystem, Python-basedCons: UI can be clunky, requires careful resource management
Kubeflow Pipelines
Best for: Kubernetes-native ML workflowsPros: Strong artifact tracking, integrates with K8s, UI for pipeline visualizationCons: Complex setup, Kubernetes required
Dagster
Best for: Data pipelines, modern developer experiencePros: Software-defined assets, type system, great UI, testing supportCons: Newer (less mature), smaller community
For greenfield ML projects, Dagster offers the best developer experience. For existing systems, Airflow is the safest bet. If you’re all-in on Kubernetes, Kubeflow provides deep integration.
Apache Airflow
Airflow is the most popular orchestration tool. DAGs are Python code:- KubernetesPodOperator: Run each task as a K8s pod (full isolation)
- Scheduling: Cron-like schedules or event triggers
- Retries: Automatic retry with exponential backoff
- Backfilling: Rerun past intervals easily
For production, use Astronomer (managed Airflow) or MWAA (AWS Managed Workflows for Apache Airflow) to avoid infrastructure headaches.
Passing Data Between Tasks
Airflow has two approaches:-
XComs (small data): Pass JSON-serializable values
-
Artifacts (large data): Write to S3/GCS, pass path via XCom
Never pass large datasets through XComs. Always use object storage (S3/GCS/MinIO) for intermediate data.
Kubeflow Pipelines
Kubeflow is Kubernetes-native and provides strong artifact tracking:- Artifact tracking: Inputs/outputs are first-class citizens
- Lineage: Track which data produced which model
- UI: Visualize DAG and inspect artifacts
- Vertex AI: Google Cloud offers managed Kubeflow
Kubeflow uses Argo Workflows under the hood for scheduling. Each component runs in its own container.
Dagster
Dagster treats data as assets (tables, models, reports) instead of tasks:- Dependencies are inferred from function signatures
- Type checking catches errors at compile time
- Easier to test (just call the function!)
- Built-in data quality checks
Dagster’s software-defined assets are a paradigm shift. Instead of “run this task”, you think “materialize this asset”. This makes pipelines more declarative and testable.
Choosing a Platform
| Criteria | Airflow | Kubeflow | Dagster |
|---|---|---|---|
| Maturity | Very high | High | Medium |
| Learning curve | Medium | High | Low |
| K8s required | No | Yes | No |
| Artifact tracking | Manual | Built-in | Built-in |
| Testing | Hard | Medium | Easy |
| Best for | Batch ETL | K8s ML workflows | Data pipelines |
Don’t over-engineer early. Start with a simple Makefile or shell script. Graduate to orchestration when you have >5 tasks or need scheduling.
Integration with Training
All platforms integrate with experiment tracking:Modal for Serverless Orchestration
For simpler workflows, Modal offers serverless functions:Hands-On Examples
Explore orchestration in Module 4:- Airflow DAGs for training and inference
- Kubeflow Pipelines with artifact tracking
- Dagster asset-based workflows
- Deploying Modal functions from pipelines
Next Steps
Model Serving
Deploy models from pipelines
Monitoring
Track pipeline health