Skip to main content

Pipeline Orchestration

This module demonstrates how to build end-to-end training and inference pipelines using three popular orchestration frameworks: Apache Airflow, Kubeflow Pipelines, and Dagster.

Why Pipeline Orchestration?

Pipeline orchestration tools help you:
  • Modularize workflows: Break complex ML workflows into manageable, reusable components
  • Schedule and automate: Run training and inference jobs on schedules or triggers
  • Track dependencies: Automatically manage task dependencies and execution order
  • Monitor execution: Visualize pipeline runs, debug failures, and track metrics
  • Scale workloads: Run computationally intensive tasks on Kubernetes or distributed systems

Orchestration Frameworks

Apache Airflow

General-purpose workflow orchestration with KubernetesPodOperator

Kubeflow Pipelines

Kubernetes-native ML pipeline orchestration with artifact tracking

Dagster

Asset-centric data orchestration with built-in data quality checks

Pipeline Architecture

Both training and inference pipelines follow a consistent structure:

Training Pipeline

Key Steps:
  1. Load Training Data: Download or prepare datasets (e.g., SST-2, SQL context data)
  2. Train Model: Fine-tune models with specified configurations
  3. Save Artifacts: Store model weights, tokenizers, and configs
  4. Upload to Registry: Push trained models to W&B or other registries

Inference Pipeline

Key Steps:
  1. Load Inference Data: Prepare input data for predictions
  2. Load Trained Model: Fetch model artifacts from registry
  3. Run Inference: Generate predictions using loaded model
  4. Save Results: Store predictions and evaluation metrics

Prerequisites

1

Create Kubernetes Cluster

Create a local Kind cluster for running orchestrated workloads:
kind create cluster --name ml-in-production
2

Set Environment Variables

Configure W&B credentials for model tracking:
export WANDB_PROJECT=your-project-name
export WANDB_API_KEY=your-api-key
3

Monitor with k9s (Optional)

Use k9s for real-time cluster monitoring:
k9s -A

Framework Comparison

FeatureAirflowKubeflowDagster
Primary FocusGeneral workflow orchestrationML-specific pipelinesData/asset orchestration
Kubernetes NativeVia operatorsYesVia executors
Artifact TrackingExternal toolsBuilt-inBuilt-in
Data Quality ChecksCustom operatorsLimitedAsset checks
UI/VisualizationWeb UI (DAGs)Web UI (pipelines)Web UI (assets)
Learning CurveModerateModerate-HighModerate
Best ForComplex schedulingK8s ML workflowsData quality focus

Learning Objectives

By completing this module, you’ll be able to:
  • Deploy and configure Airflow, Kubeflow, and Dagster
  • Build training pipelines that load data, train models, and upload artifacts
  • Create inference pipelines that fetch models and generate predictions
  • Compare orchestration frameworks for your ML use case
  • Integrate with W&B for experiment tracking
  • Run containerized workloads on Kubernetes

Module Resources

Additional Reading

Build docs developers (and LLMs) love