Skip to main content

Overview

Pipelines in DVC define the steps (stages) of your data processing and ML workflows. Each stage specifies its dependencies, command, and outputs, allowing DVC to automatically detect when recomputation is needed.
DVC pipelines are defined in dvc.yaml files and track stage execution in dvc.lock files.

Creating Your First Pipeline

1

Add your first stage

Use dvc stage add to create a stage in your pipeline:
dvc stage add -n prepare \
  -d data/raw/dataset.csv \
  -o data/prepared/train.csv \
  -o data/prepared/test.csv \
  python scripts/prepare.py
This creates a stage named prepare that:
  • Depends on (-d): data/raw/dataset.csv
  • Outputs (-o): data/prepared/train.csv and data/prepared/test.csv
  • Runs: python scripts/prepare.py
2

Add dependent stages

Create a stage that depends on previous outputs:
dvc stage add -n train \
  -d scripts/train.py \
  -d data/prepared/train.csv \
  -p train.epochs,train.lr \
  -o models/model.pkl \
  -m metrics/train.json \
  python scripts/train.py
This stage:
  • Depends on the training script and prepared data
  • Uses parameters (-p) from params.yaml
  • Outputs a model file and metrics
3

Run your pipeline

Execute the entire pipeline:
dvc repro
DVC automatically:
  • Determines the correct execution order
  • Skips stages that haven’t changed
  • Runs only what’s necessary
Run a specific stage with dvc repro train or force re-execution with dvc repro -f

Understanding dvc.yaml

When you add stages, DVC creates a dvc.yaml file:
dvc.yaml
stages:
  prepare:
    cmd: python scripts/prepare.py
    deps:
      - data/raw/dataset.csv
    outs:
      - data/prepared/train.csv
      - data/prepared/test.csv

  train:
    cmd: python scripts/train.py
    deps:
      - scripts/train.py
      - data/prepared/train.csv
    params:
      - train.epochs
      - train.lr
    outs:
      - models/model.pkl
    metrics:
      - metrics/train.json:
          cache: false

  evaluate:
    cmd: python scripts/evaluate.py
    deps:
      - scripts/evaluate.py
      - data/prepared/test.csv
      - models/model.pkl
    metrics:
      - metrics/test.json:
          cache: false

Stage Components

Dependencies (-d, --deps)

Files or directories that the stage needs:
dvc stage add -n train \
  -d data/train.csv \
  -d scripts/train.py \
  python scripts/train.py
When a dependency changes, DVC knows the stage needs to be re-run.

Parameters (-p, --params)

Values from params.yaml that affect stage execution:
params.yaml
train:
  epochs: 10
  lr: 0.001
  batch_size: 32

model:
  layers: [128, 64, 32]
  dropout: 0.2
dvc stage add -n train \
  -p train.epochs,train.lr,train.batch_size \
  python scripts/train.py

Outputs (-o, --outs)

Files or directories created by the stage:
dvc stage add -n train \
  -o models/model.pkl \
  python scripts/train.py
Regular outputs are cached by DVC (recommended for models, data files).

Metrics (-m, --metrics)

JSON, YAML, or CSV files containing metrics:
dvc stage add -n evaluate \
  -d models/model.pkl \
  -m metrics/scores.json \
  python scripts/evaluate.py
Metrics are special outputs that DVC tracks for comparison. They’re not cached by default.

Plots (--plots)

Data files for visualizations:
dvc stage add -n evaluate \
  --plots plots/confusion_matrix.csv \
  python scripts/evaluate.py

Advanced Stage Options

Working Directory (-w, --wdir)

Run the command in a specific directory:
dvc stage add -n train \
  -w src/models \
  -d ../../data/train.csv \
  python train.py

Always Changed (--always-changed)

Force a stage to run every time:
dvc stage add -n download \
  --always-changed \
  -o data/external/dataset.csv \
  python scripts/download.py
Use --always-changed sparingly. It bypasses DVC’s caching and dependency tracking.

Description (--desc)

Add human-readable descriptions to stages:
dvc stage add -n train \
  --desc "Train XGBoost model with hyperparameter tuning" \
  python scripts/train.py

Force Overwrite (-f, --force)

Overwrite an existing stage:
dvc stage add -n train -f \
  -d data/train.csv \
  python scripts/new_train.py

Managing Pipelines

dvc repro

Pipeline Visualization

View your pipeline structure:
$ dvc dag

         +-------------+
         | data.dvc    |
         +-------------+
                *
                *
                *
          +---------+
          | prepare |
          +---------+
           **        **
         **            **
        *                *
+-------+                +----------+
| train |                | validate |
+-------+                +----------+
        **            **
          **        **
            *      *
         +----------+
         | evaluate |
         +----------+

Complete Example

Here’s a full ML pipeline:
1

Data preparation

dvc stage add -n prepare \
  -d data/raw/dataset.csv \
  -d scripts/prepare.py \
  -o data/prepared/train.csv \
  -o data/prepared/test.csv \
  python scripts/prepare.py
2

Feature engineering

dvc stage add -n featurize \
  -d scripts/featurize.py \
  -d data/prepared/train.csv \
  -d data/prepared/test.csv \
  -o data/features/train.pkl \
  -o data/features/test.pkl \
  python scripts/featurize.py
3

Model training

dvc stage add -n train \
  -d scripts/train.py \
  -d data/features/train.pkl \
  -p train.epochs,train.lr,model \
  -o models/model.pkl \
  -m metrics/train.json \
  python scripts/train.py
4

Model evaluation

dvc stage add -n evaluate \
  -d scripts/evaluate.py \
  -d data/features/test.pkl \
  -d models/model.pkl \
  -m metrics/test.json \
  --plots plots/roc_curve.csv \
  --plots plots/confusion_matrix.csv \
  python scripts/evaluate.py

Best Practices

Small, focused stages

Break pipelines into logical steps. Each stage should do one thing well.

Declare all dependencies

Include scripts, data files, and config files as dependencies for accurate tracking.

Use parameters

Store hyperparameters in params.yaml for easy experimentation.

Version control dvc.yaml

Commit dvc.yaml and dvc.lock to Git to share pipelines with your team.

Next Steps

Running Experiments

Run multiple pipeline variations with different parameters

Remote Storage

Store pipeline outputs and intermediate results remotely

Build docs developers (and LLMs) love