File Types Overview
.dvc Files
Single-stage files for tracking data
dvc.yaml
Multi-stage pipeline definitions
dvc.lock
Lock file for reproducibility
.dvc Files (Single-Stage Files)
.dvc files are used to track individual data files or directories. They’re created with dvc add or when defining single-stage operations.
Basic Structure
A typical.dvc file contains output metadata:
Complete Schema
List of output files or directories tracked by this .dvc file
List of dependencies (for single-stage files with commands)
Command to execute (for single-stage files)
Working directory for the command
MD5 checksum of the stage definition
Whether the stage is frozen (won’t be re-executed)
Always consider this stage as changed
Custom metadata for the stage
Description of the stage
Examples
Tracking a single file
Tracking a single file
Tracking a directory
Tracking a directory
Directory checksums end with
.dir and represent a hash of all files within.Single-stage with command
Single-stage with command
Output with custom remote
Output with custom remote
Non-cached output
Non-cached output
Setting
cache: false is useful for small files like metrics that don’t need caching.dvc.yaml (Pipeline Files)
dvc.yaml files define multi-stage pipelines with dependencies, parameters, and outputs.
Basic Structure
Complete Schema
Dictionary of pipeline stages, where keys are stage names
Variables that can be referenced in the pipeline using
${var}Global parameter files to track
Global metric files
Global plot definitions
Model registry artifacts
Dataset definitions
Advanced Examples
Stage with detailed outputs
Stage with detailed outputs
Multi-command stage
Multi-command stage
Foreach iteration
Foreach iteration
This creates three stages:
process@train, process@test, and process@val.Matrix for hyperparameter sweep
Matrix for hyperparameter sweep
Using variables
Using variables
Working directory example
Working directory example
dvc.lock (Lock Files)
dvc.lock is automatically generated and should not be edited manually. It ensures reproducibility by recording exact states.
Structure
Schema Fields
Lock file schema version (currently “2.0”)
Locked state of each stage
Locked dataset states
Lock File Features
DVC uses the lock file to determine if a stage needs to be re-executed:
- If dependencies or parameters change, the stage runs again
- If the lock file matches current state, the stage is skipped
File Naming Conventions
Valid .dvc filenames
data.csv.dvcmodel.pkl.dvcimages.dvcany_name.dvc
Pipeline files
dvc.yaml(standard)dvc.lock(auto-generated)- Custom:
pipeline.yaml❌ - Custom:
train.dvc.yaml❌
Pipeline files must be named exactly
dvc.yaml. The .dvc extension is only
for single-stage tracking files.Best Practices
Commit all DVC files to Git
Commit all DVC files to Git
Always track these files:
.dvcfilesdvc.yamldvc.lockparams.yaml
- Actual data files
- Cache directories
.dvc/config.local
Use descriptive stage names
Use descriptive stage names
Good:Bad:
Add descriptions to stages
Add descriptions to stages
Organize parameters by purpose
Organize parameters by purpose
Use meaningful metadata
Use meaningful metadata
Related Commands
Next Steps
Configuration
Learn about DVC configuration files
Remote Storage
Configure remote storage backends