Overview
Pipelines in DVC define the steps (stages) of your data processing and ML workflows. Each stage specifies its dependencies, command, and outputs, allowing DVC to automatically detect when recomputation is needed.DVC pipelines are defined in
dvc.yaml files and track stage execution in dvc.lock files.Creating Your First Pipeline
Add your first stage
Use This creates a stage named
dvc stage add to create a stage in your pipeline:prepare that:- Depends on (
-d):data/raw/dataset.csv - Outputs (
-o):data/prepared/train.csvanddata/prepared/test.csv - Runs:
python scripts/prepare.py
Add dependent stages
Create a stage that depends on previous outputs:This stage:
- Depends on the training script and prepared data
- Uses parameters (
-p) fromparams.yaml - Outputs a model file and metrics
Understanding dvc.yaml
When you add stages, DVC creates advc.yaml file:
dvc.yaml
Stage Components
Dependencies (-d, --deps)
Files or directories that the stage needs:
When a dependency changes, DVC knows the stage needs to be re-run.
Parameters (-p, --params)
Values from params.yaml that affect stage execution:
params.yaml
Outputs (-o, --outs)
Files or directories created by the stage:
- Cached outputs
- Non-cached outputs
- Persistent outputs
Metrics (-m, --metrics)
JSON, YAML, or CSV files containing metrics:
Metrics are special outputs that DVC tracks for comparison. They’re not cached by default.
Plots (--plots)
Data files for visualizations:
Advanced Stage Options
Working Directory (-w, --wdir)
Run the command in a specific directory:
Always Changed (--always-changed)
Force a stage to run every time:
Description (--desc)
Add human-readable descriptions to stages:
Force Overwrite (-f, --force)
Overwrite an existing stage:
Managing Pipelines
Pipeline Visualization
View your pipeline structure:Complete Example
Here’s a full ML pipeline:Best Practices
Small, focused stages
Break pipelines into logical steps. Each stage should do one thing well.
Declare all dependencies
Include scripts, data files, and config files as dependencies for accurate tracking.
Use parameters
Store hyperparameters in
params.yaml for easy experimentation.Version control dvc.yaml
Commit
dvc.yaml and dvc.lock to Git to share pipelines with your team.Next Steps
Running Experiments
Run multiple pipeline variations with different parameters
Remote Storage
Store pipeline outputs and intermediate results remotely