What are Pipelines?
Pipelines in DVC are directed acyclic graphs (DAGs) of data processing stages. Each stage represents a command that transforms inputs into outputs, with dependencies automatically tracked. Pipelines enable reproducible, automated workflows from raw data to final results.Key Concept: Pipelines are defined in
dvc.yaml files, with stages connected through dependencies. DVC automatically determines execution order and only runs stages when their dependencies change.Why Pipelines Matter
- Reproducibility: Codify the entire workflow from raw data to results
- Automation: Run only what changed with
dvc repro - Visibility: Visualize and understand complex workflows with
dvc dag - Collaboration: Share workflows as code, not documentation
- Version control: Track how processing logic evolves alongside data
Pipeline Structure
Pipelines are defined indvc.yaml files. Here’s a typical machine learning pipeline:
Stage Anatomy
Each stage indvc.yaml has several components, implemented in dvc/stage/__init__.py:
Command (cmd)
The command to execute. Can be any shell command:dvc/stage/__init__.py:130-154:
Dependencies (deps)
Files or directories the stage needs. If any dependency changes, the stage is considered outdated:- Remote files: URLs or cloud storage paths
- Outputs from other stages: Automatic pipeline chaining
- External repo files: From other DVC/Git repositories
Parameters (params)
Values from parameter files (likeparams.yaml) used as dependencies:
dvc/stage/__init__.py:197-200:
Outputs (outs)
Files or directories the stage produces. Automatically tracked with DVC:Metrics
Numerical outputs for comparing experiments:Metrics typically have
cache: false since they’re small text files that change frequently.Plots
Data files for visualization:Pipeline Execution Flow
When you rundvc repro, DVC follows this process:
1. Dependency Resolution
DVC builds a dependency graph by analyzing all stages. Fromdvc/stage/utils.py, the system checks for circular dependencies:
2. Stage Status Check
For each stage, DVC checks if it needs to run. A stage is outdated if:- Any dependency has changed (checksum differs)
- Any parameter has changed
- The command has changed
- Outputs are missing
- Stage is marked as
always_changed
Stage Change Detection Logic
Stage Change Detection Logic
From Callback stages (commands with no dependencies or outputs) always run - useful for notifications or logging.
dvc/stage/__init__.py:239-263:3. Execution Order
Stages run in topological order based on dependencies. If stage B depends on stage A’s output, A runs first:stage_a → stage_b
4. Stage Execution
When a stage runs, DVC:- Changes to the stage’s working directory
- Executes the command
- Hashes all outputs and updates the lockfile
- Caches outputs (unless
cache: false)
dvc/stage/run.py:run_stage:
The Lockfile (dvc.lock)
After running, DVC generatesdvc.lock with exact versions of all dependencies and outputs:
dvc/dvcfile.py:394-476. It serves as:
- Version snapshot: Records exact state of all inputs/outputs
- Reproducibility guarantee: Ensures same inputs produce same outputs
- Change detection: DVC compares current state to lockfile
Pipeline Features
Frozen Stages
Prevent stages from running even if outdated:Stage Descriptions
Document what each stage does:Working Directory
Run commands from specific directories:Pipeline Visualization
Visualize your pipeline as an ASCII DAG:dvc/dagascii.py to render the graph.
Multiple Pipelines
You can have multipledvc.yaml files in different directories:
dvc.yaml creates an independent pipeline with its own dvc.lock.
Advanced: Foreach Stages
Run the same stage with different parameters:train@model_a, train@model_b, train@model_c.
Templating with Parameters
Reference parameters directly indvc.yaml:
Comparing Pipelines to Scripts
| Aspect | Shell Script | DVC Pipeline |
|---|---|---|
| Execution | Always runs everything | Only runs what changed |
| Dependencies | Manual tracking | Automatic detection |
| Reproducibility | Document-based | Code-based with versions |
| Visualization | None | dvc dag |
| Parallelization | Manual | Automatic with -j flag |
Related Commands
dvc repro- Reproduce a pipelinedvc dag- Visualize pipeline structuredvc stage- Manage pipeline stagesdvc run- Create a new stagedvc status- Check pipeline status
Next Steps
Experiments
Run pipeline variations with different parameters
Data Versioning
Understand how pipeline outputs are tracked