Building Pipelines

Overview

Pipelines in DVC define the steps (stages) of your data processing and ML workflows. Each stage specifies its dependencies, command, and outputs, allowing DVC to automatically detect when recomputation is needed.

DVC pipelines are defined in dvc.yaml files and track stage execution in dvc.lock files.

Creating Your First Pipeline

Add your first stage

Use dvc stage add to create a stage in your pipeline:

dvc stage add -n prepare \
  -d data/raw/dataset.csv \
  -o data/prepared/train.csv \
  -o data/prepared/test.csv \
  python scripts/prepare.py

This creates a stage named prepare that:

Depends on (-d): data/raw/dataset.csv
Outputs (-o): data/prepared/train.csv and data/prepared/test.csv
Runs: python scripts/prepare.py

Add dependent stages

Create a stage that depends on previous outputs:

dvc stage add -n train \
  -d scripts/train.py \
  -d data/prepared/train.csv \
  -p train.epochs,train.lr \
  -o models/model.pkl \
  -m metrics/train.json \
  python scripts/train.py

This stage:

Depends on the training script and prepared data
Uses parameters (-p) from params.yaml
Outputs a model file and metrics

Run your pipeline

Execute the entire pipeline:

dvc repro

DVC automatically:

Determines the correct execution order
Skips stages that haven’t changed
Runs only what’s necessary

Run a specific stage with dvc repro train or force re-execution with dvc repro -f

Understanding dvc.yaml

When you add stages, DVC creates a dvc.yaml file:

dvc.yaml

stages:
  prepare:
    cmd: python scripts/prepare.py
    deps:
      - data/raw/dataset.csv
    outs:
      - data/prepared/train.csv
      - data/prepared/test.csv

  train:
    cmd: python scripts/train.py
    deps:
      - scripts/train.py
      - data/prepared/train.csv
    params:
      - train.epochs
      - train.lr
    outs:
      - models/model.pkl
    metrics:
      - metrics/train.json:
          cache: false

  evaluate:
    cmd: python scripts/evaluate.py
    deps:
      - scripts/evaluate.py
      - data/prepared/test.csv
      - models/model.pkl
    metrics:
      - metrics/test.json:
          cache: false

Stage Components

Dependencies (`-d`, `--deps`)

Files or directories that the stage needs:

dvc stage add -n train \
  -d data/train.csv \
  -d scripts/train.py \
  python scripts/train.py

When a dependency changes, DVC knows the stage needs to be re-run.

Parameters (`-p`, `--params`)

Values from params.yaml that affect stage execution:

params.yaml

train:
  epochs: 10
  lr: 0.001
  batch_size: 32

model:
  layers: [128, 64, 32]
  dropout: 0.2

dvc stage add -n train \
  -p train.epochs,train.lr,train.batch_size \
  python scripts/train.py

Outputs (`-o`, `--outs`)

Files or directories created by the stage:

Cached outputs
Non-cached outputs
Persistent outputs

dvc stage add -n train \
  -o models/model.pkl \
  python scripts/train.py

Regular outputs are cached by DVC (recommended for models, data files).

dvc stage add -n train \
  -O logs/training.log \
  python scripts/train.py

Use -O for outputs you don’t want cached (logs, temporary files).

dvc stage add -n train \
  --outs-persist checkpoints/ \
  python scripts/train.py

Persistent outputs aren’t removed during dvc repro.

Metrics (`-m`, `--metrics`)

JSON, YAML, or CSV files containing metrics:

dvc stage add -n evaluate \
  -d models/model.pkl \
  -m metrics/scores.json \
  python scripts/evaluate.py

Metrics are special outputs that DVC tracks for comparison. They’re not cached by default.

Plots (`--plots`)

Data files for visualizations:

dvc stage add -n evaluate \
  --plots plots/confusion_matrix.csv \
  python scripts/evaluate.py

Advanced Stage Options

Working Directory (`-w`, `--wdir`)

Run the command in a specific directory:

dvc stage add -n train \
  -w src/models \
  -d ../../data/train.csv \
  python train.py

Always Changed (`--always-changed`)

Force a stage to run every time:

dvc stage add -n download \
  --always-changed \
  -o data/external/dataset.csv \
  python scripts/download.py

Use --always-changed sparingly. It bypasses DVC’s caching and dependency tracking.

Description (`--desc`)

Add human-readable descriptions to stages:

dvc stage add -n train \
  --desc "Train XGBoost model with hyperparameter tuning" \
  python scripts/train.py

Force Overwrite (`-f`, `--force`)

Overwrite an existing stage:

dvc stage add -n train -f \
  -d data/train.csv \
  python scripts/new_train.py

Managing Pipelines

dvc repro

Pipeline Visualization

View your pipeline structure:

$ dvc dag

         +-------------+
         | data.dvc    |
         +-------------+
                *
                *
                *
          +---------+
          | prepare |
          +---------+
           **        **
         **            **
        *                *
+-------+                +----------+
| train |                | validate |
+-------+                +----------+
        **            **
          **        **
            *      *
         +----------+
         | evaluate |
         +----------+

Complete Example

Here’s a full ML pipeline:

Data preparation

dvc stage add -n prepare \
  -d data/raw/dataset.csv \
  -d scripts/prepare.py \
  -o data/prepared/train.csv \
  -o data/prepared/test.csv \
  python scripts/prepare.py

Feature engineering

dvc stage add -n featurize \
  -d scripts/featurize.py \
  -d data/prepared/train.csv \
  -d data/prepared/test.csv \
  -o data/features/train.pkl \
  -o data/features/test.pkl \
  python scripts/featurize.py

Model training

dvc stage add -n train \
  -d scripts/train.py \
  -d data/features/train.pkl \
  -p train.epochs,train.lr,model \
  -o models/model.pkl \
  -m metrics/train.json \
  python scripts/train.py

Model evaluation

dvc stage add -n evaluate \
  -d scripts/evaluate.py \
  -d data/features/test.pkl \
  -d models/model.pkl \
  -m metrics/test.json \
  --plots plots/roc_curve.csv \
  --plots plots/confusion_matrix.csv \
  python scripts/evaluate.py

Best Practices

Small, focused stages

Break pipelines into logical steps. Each stage should do one thing well.

Declare all dependencies

Include scripts, data files, and config files as dependencies for accurate tracking.

Use parameters

Store hyperparameters in params.yaml for easy experimentation.

Version control dvc.yaml

Commit dvc.yaml and dvc.lock to Git to share pipelines with your team.

Next Steps

Running Experiments

Run multiple pipeline variations with different parameters

Remote Storage

Store pipeline outputs and intermediate results remotely

Get Started

Core Concepts

User Guide

Configuration

Building Pipelines

Overview

Creating Your First Pipeline

Understanding dvc.yaml

Stage Components

Dependencies (`-d`, `--deps`)

Parameters (`-p`, `--params`)

Outputs (`-o`, `--outs`)

Metrics (`-m`, `--metrics`)

Plots (`--plots`)

Advanced Stage Options

Working Directory (`-w`, `--wdir`)

Always Changed (`--always-changed`)

Description (`--desc`)

Force Overwrite (`-f`, `--force`)

Managing Pipelines

Pipeline Visualization

Complete Example

Best Practices

Small, focused stages

Declare all dependencies

Use parameters

Version control dvc.yaml

Next Steps

Running Experiments

Remote Storage

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guide

Configuration

​Overview

​Creating Your First Pipeline

​Understanding dvc.yaml

​Stage Components

​Dependencies (-d, --deps)

​Parameters (-p, --params)

​Outputs (-o, --outs)

​Metrics (-m, --metrics)

​Plots (--plots)

​Advanced Stage Options

​Working Directory (-w, --wdir)

​Always Changed (--always-changed)

​Description (--desc)

​Force Overwrite (-f, --force)

​Managing Pipelines

​Pipeline Visualization

​Complete Example

​Best Practices

Small, focused stages

Declare all dependencies

Use parameters

Version control dvc.yaml

​Next Steps

Running Experiments

Remote Storage

Build docs developers (and LLMs) love

Overview

Creating Your First Pipeline

Understanding dvc.yaml

Stage Components

Dependencies (`-d`, `--deps`)

Parameters (`-p`, `--params`)

Outputs (`-o`, `--outs`)

Metrics (`-m`, `--metrics`)

Plots (`--plots`)

Advanced Stage Options

Working Directory (`-w`, `--wdir`)

Always Changed (`--always-changed`)

Description (`--desc`)

Force Overwrite (`-f`, `--force`)

Managing Pipelines

Pipeline Visualization

Complete Example

Best Practices

Next Steps