Pipelines

What are Pipelines?

Pipelines in DVC are directed acyclic graphs (DAGs) of data processing stages. Each stage represents a command that transforms inputs into outputs, with dependencies automatically tracked. Pipelines enable reproducible, automated workflows from raw data to final results.

Key Concept: Pipelines are defined in dvc.yaml files, with stages connected through dependencies. DVC automatically determines execution order and only runs stages when their dependencies change.

Why Pipelines Matter

Reproducibility: Codify the entire workflow from raw data to results
Automation: Run only what changed with dvc repro
Visibility: Visualize and understand complex workflows with dvc dag
Collaboration: Share workflows as code, not documentation
Version control: Track how processing logic evolves alongside data

Pipeline Structure

Pipelines are defined in dvc.yaml files. Here’s a typical machine learning pipeline:

stages:
  prepare:
    cmd: python prepare.py
    deps:
      - data/raw.csv
      - prepare.py
    outs:
      - data/prepared.csv
  
  train:
    cmd: python train.py
    deps:
      - data/prepared.csv
      - train.py
    params:
      - train.learning_rate
      - train.epochs
    outs:
      - model.pkl
    metrics:
      - metrics.json:
          cache: false

Stage Anatomy

Each stage in dvc.yaml has several components, implemented in dvc/stage/__init__.py:

Command (cmd)

The command to execute. Can be any shell command:

cmd: python train.py --epochs 10

From dvc/stage/__init__.py:130-154:

class Stage(params.StageParams):
    def __init__(
        self,
        repo,
        path=None,
        cmd=None,  # The command to run
        wdir=os.curdir,
        deps=None,
        outs=None,
        # ...
    ):
        self.cmd = cmd
        self.wdir = wdir  # Working directory
        self.outs: list[Output] = outs
        self.deps: list[Dependency] = deps

Dependencies (deps)

Files or directories the stage needs. If any dependency changes, the stage is considered outdated:

deps:
  - data/input.csv
  - src/process.py
  - config.json

Dependencies can also be:

Remote files: URLs or cloud storage paths
Outputs from other stages: Automatic pipeline chaining
External repo files: From other DVC/Git repositories

Parameters (params)

Values from parameter files (like params.yaml) used as dependencies:

params:
  - train.learning_rate
  - train.batch_size
  - model.architecture

Parameter dependencies are special - from dvc/stage/__init__.py:197-200:

@property
def params(self) -> list["ParamsDependency"]:
    from dvc.dependency import ParamsDependency
    return [dep for dep in self.deps if isinstance(dep, ParamsDependency)]

Outputs (outs)

Files or directories the stage produces. Automatically tracked with DVC:

outs:
  - data/processed.csv
  - models/

Metrics

Numerical outputs for comparing experiments:

metrics:
  - metrics.json:
      cache: false

Metrics typically have cache: false since they’re small text files that change frequently.

Plots

Data files for visualization:

plots:
  - plots/training.csv:
      x: epoch
      y: loss
  - plots/confusion_matrix.json:
      template: confusion

Pipeline Execution Flow

When you run dvc repro, DVC follows this process:

1. Dependency Resolution

DVC builds a dependency graph by analyzing all stages. From dvc/stage/utils.py, the system checks for circular dependencies:

def check_circular_dependency(stage):
    """Check if stage has circular dependencies."""
    from dvc.exceptions import CyclicGraphError
    # Implementation ensures DAG structure

2. Stage Status Check

For each stage, DVC checks if it needs to run. A stage is outdated if:

Any dependency has changed (checksum differs)
Any parameter has changed
The command has changed
Outputs are missing
Stage is marked as always_changed

Stage Change Detection Logic

From dvc/stage/__init__.py:239-263:

@property
def is_data_source(self) -> bool:
    """Whether the DVC file was created with `dvc add` or `dvc import`"""
    return self.cmd is None

@property
def is_callback(self) -> bool:
    """
    A callback stage is always considered as changed,
    so it runs on every `dvc repro` call.
    """
    return self.cmd and not any((self.deps, self.outs))

Callback stages (commands with no dependencies or outputs) always run - useful for notifications or logging.

3. Execution Order

Stages run in topological order based on dependencies. If stage B depends on stage A’s output, A runs first:

stages:
  stage_a:
    cmd: python a.py
    outs:
      - a.txt
  
  stage_b:
    cmd: python b.py
    deps:
      - a.txt  # Depends on stage_a's output
    outs:
      - b.txt

Execution order: stage_a → stage_b

4. Stage Execution

When a stage runs, DVC:

Changes to the stage’s working directory
Executes the command
Hashes all outputs and updates the lockfile
Caches outputs (unless cache: false)

From dvc/stage/run.py:run_stage:

def run_stage(stage, **kwargs):
    """Run a stage command and capture outputs."""
    # Execute command in subprocess
    # Hash and cache outputs
    # Update dvc.lock

The Lockfile (dvc.lock)

After running, DVC generates dvc.lock with exact versions of all dependencies and outputs:

schema: '2.0'
stages:
  train:
    cmd: python train.py
    deps:
    - path: data/prepared.csv
      md5: a1b2c3d4e5f6
      size: 1048576
    - path: train.py
      md5: 1a2b3c4d5e6f
      size: 4096
    params:
      params.yaml:
        train.learning_rate: 0.001
        train.epochs: 10
    outs:
    - path: model.pkl
      md5: 9z8y7x6w5v4u
      size: 2097152

The lockfile is implemented in dvc/dvcfile.py:394-476. It serves as:

Version snapshot: Records exact state of all inputs/outputs
Reproducibility guarantee: Ensures same inputs produce same outputs
Change detection: DVC compares current state to lockfile

Best Practice: Commit dvc.lock to Git. It enables reproducibility and prevents unnecessary re-runs.

Pipeline Features

Frozen Stages

Prevent stages from running even if outdated:

stages:
  expensive_stage:
    frozen: true
    cmd: python expensive_process.py
    deps:
      - huge_dataset/
    outs:
      - results/

Stage Descriptions

Document what each stage does:

stages:
  train:
    desc: |
      Train the neural network model using prepared data.
      Uses hyperparameters from params.yaml.
    cmd: python train.py

Working Directory

Run commands from specific directories:

stages:
  preprocess:
    wdir: src/preprocessing
    cmd: python clean.py
    deps:
      - ../../data/raw.csv
    outs:
      - ../../data/clean.csv

Pipeline Visualization

Visualize your pipeline as an ASCII DAG:

dvc dag

         +----------+
         | data.dvc |
         +----------+
              *
              *
              *
         +---------+
         | prepare |
         +---------+
              *
              *
              *
          +-------+
          | train |
          +-------+
              *
              *
              *
        +----------+
        | evaluate |
        +----------+

The DAG implementation uses dvc/dagascii.py to render the graph.

Multiple Pipelines

You can have multiple dvc.yaml files in different directories:

project/
├── dvc.yaml              # Main pipeline
├── data/
│   └── dvc.yaml          # Data preparation pipeline
└── models/
    └── dvc.yaml          # Model training pipeline

Each dvc.yaml creates an independent pipeline with its own dvc.lock.

Advanced: Foreach Stages

Run the same stage with different parameters:

stages:
  train:
    foreach:
      - model_a
      - model_b
      - model_c
    do:
      cmd: python train.py ${item}
      deps:
        - train.py
      outs:
        - models/${item}.pkl

This creates three stages: train@model_a, train@model_b, train@model_c.

Templating with Parameters

Reference parameters directly in dvc.yaml:

# params.yaml
model_path: models/model.pkl
epochs: 50

# dvc.yaml
stages:
  train:
    cmd: python train.py --epochs ${epochs}
    outs:
      - ${model_path}

Templating makes pipelines more flexible and reusable across different configurations.

Comparing Pipelines to Scripts

Aspect	Shell Script	DVC Pipeline
Execution	Always runs everything	Only runs what changed
Dependencies	Manual tracking	Automatic detection
Reproducibility	Document-based	Code-based with versions
Visualization	None	`dvc dag`
Parallelization	Manual	Automatic with `-j` flag

dvc repro - Reproduce a pipeline
dvc dag - Visualize pipeline structure
dvc stage - Manage pipeline stages
dvc run - Create a new stage
dvc status - Check pipeline status

Get Started

Core Concepts

User Guide

Configuration

Pipelines

What are Pipelines?

Why Pipelines Matter

Pipeline Structure

Stage Anatomy

Command (cmd)

Dependencies (deps)

Parameters (params)

Outputs (outs)

Metrics

Plots

Pipeline Execution Flow

1. Dependency Resolution

2. Stage Status Check

3. Execution Order

4. Stage Execution

The Lockfile (dvc.lock)

Pipeline Features

Frozen Stages

Stage Descriptions

Working Directory

Pipeline Visualization

Multiple Pipelines

Advanced: Foreach Stages

Templating with Parameters

Comparing Pipelines to Scripts

Next Steps

Experiments

Data Versioning

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guide

Configuration

​What are Pipelines?

​Why Pipelines Matter

​Pipeline Structure

​Stage Anatomy

​Command (cmd)

​Dependencies (deps)

​Parameters (params)

​Outputs (outs)

​Metrics

​Plots

​Pipeline Execution Flow

​1. Dependency Resolution

​2. Stage Status Check

​3. Execution Order

​4. Stage Execution

​The Lockfile (dvc.lock)

​Pipeline Features

​Frozen Stages

​Stage Descriptions

​Working Directory

​Pipeline Visualization

​Multiple Pipelines

​Advanced: Foreach Stages

​Templating with Parameters

​Comparing Pipelines to Scripts

​Related Commands

​Next Steps

Experiments

Data Versioning

Build docs developers (and LLMs) love

What are Pipelines?

Why Pipelines Matter

Pipeline Structure

Stage Anatomy

Command (cmd)

Dependencies (deps)

Parameters (params)

Outputs (outs)

Metrics

Plots

Pipeline Execution Flow

1. Dependency Resolution

2. Stage Status Check

3. Execution Order

4. Stage Execution

The Lockfile (dvc.lock)

Pipeline Features

Frozen Stages

Stage Descriptions

Working Directory

Pipeline Visualization

Multiple Pipelines

Advanced: Foreach Stages

Templating with Parameters

Comparing Pipelines to Scripts

Related Commands

Next Steps