Pipeline Orchestration

This module demonstrates how to build end-to-end training and inference pipelines using three popular orchestration frameworks: Apache Airflow, Kubeflow Pipelines, and Dagster.

Why Pipeline Orchestration?

Pipeline orchestration tools help you:

Modularize workflows: Break complex ML workflows into manageable, reusable components
Schedule and automate: Run training and inference jobs on schedules or triggers
Track dependencies: Automatically manage task dependencies and execution order
Monitor execution: Visualize pipeline runs, debug failures, and track metrics
Scale workloads: Run computationally intensive tasks on Kubernetes or distributed systems

Orchestration Frameworks

Apache Airflow

General-purpose workflow orchestration with KubernetesPodOperator

Kubeflow Pipelines

Kubernetes-native ML pipeline orchestration with artifact tracking

Dagster

Asset-centric data orchestration with built-in data quality checks

Pipeline Architecture

Both training and inference pipelines follow a consistent structure:

Training Pipeline

Key Steps:

Load Training Data: Download or prepare datasets (e.g., SST-2, SQL context data)
Train Model: Fine-tune models with specified configurations
Save Artifacts: Store model weights, tokenizers, and configs
Upload to Registry: Push trained models to W&B or other registries

Inference Pipeline

Key Steps:

Load Inference Data: Prepare input data for predictions
Load Trained Model: Fetch model artifacts from registry
Run Inference: Generate predictions using loaded model
Save Results: Store predictions and evaluation metrics

Prerequisites

Create Kubernetes Cluster

Create a local Kind cluster for running orchestrated workloads:

kind create cluster --name ml-in-production

Set Environment Variables

Configure W&B credentials for model tracking:

export WANDB_PROJECT=your-project-name
export WANDB_API_KEY=your-api-key

Monitor with k9s (Optional)

Use k9s for real-time cluster monitoring:

k9s -A

Framework Comparison

Feature	Airflow	Kubeflow	Dagster
Primary Focus	General workflow orchestration	ML-specific pipelines	Data/asset orchestration
Kubernetes Native	Via operators	Yes	Via executors
Artifact Tracking	External tools	Built-in	Built-in
Data Quality Checks	Custom operators	Limited	Asset checks
UI/Visualization	Web UI (DAGs)	Web UI (pipelines)	Web UI (assets)
Learning Curve	Moderate	Moderate-High	Moderate
Best For	Complex scheduling	K8s ML workflows	Data quality focus

Learning Objectives

By completing this module, you’ll be able to:

Deploy and configure Airflow, Kubeflow, and Dagster
Build training pipelines that load data, train models, and upload artifacts
Create inference pipelines that fetch models and generate predictions
Compare orchestration frameworks for your ML use case
Integrate with W&B for experiment tracking
Run containerized workloads on Kubernetes

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

Pipeline Orchestration Overview

Pipeline Orchestration

Why Pipeline Orchestration?

Orchestration Frameworks

Apache Airflow

Kubeflow Pipelines

Dagster

Pipeline Architecture

Training Pipeline

Inference Pipeline

Prerequisites

Framework Comparison

Learning Objectives

Module Resources

Additional Reading

Build docs developers (and LLMs) love

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

​Pipeline Orchestration

​Why Pipeline Orchestration?

​Orchestration Frameworks

Apache Airflow

Kubeflow Pipelines

Dagster

​Pipeline Architecture

​Training Pipeline

​Inference Pipeline

​Prerequisites

​Framework Comparison

​Learning Objectives

​Module Resources

​Additional Reading

Build docs developers (and LLMs) love

Pipeline Orchestration

Why Pipeline Orchestration?

Orchestration Frameworks

Pipeline Architecture

Training Pipeline

Inference Pipeline

Prerequisites

Framework Comparison

Learning Objectives

Module Resources

Additional Reading