Skip to main content
DVC uses several file formats to track data, define pipelines, and lock reproducible states. This guide explains the structure and purpose of each file type.

File Types Overview

.dvc Files

Single-stage files for tracking data

dvc.yaml

Multi-stage pipeline definitions

dvc.lock

Lock file for reproducibility

.dvc Files (Single-Stage Files)

.dvc files are used to track individual data files or directories. They’re created with dvc add or when defining single-stage operations.

Basic Structure

A typical .dvc file contains output metadata:
outs:
- md5: a304afb96060aad90176268345e10355
  size: 37891850
  path: model.pkl

Complete Schema

outs
array
required
List of output files or directories tracked by this .dvc file
deps
array
List of dependencies (for single-stage files with commands)
cmd
string
Command to execute (for single-stage files)
wdir
string
Working directory for the command
md5
string
MD5 checksum of the stage definition
frozen
boolean
default:"false"
Whether the stage is frozen (won’t be re-executed)
always_changed
boolean
default:"false"
Always consider this stage as changed
meta
object
Custom metadata for the stage
desc
string
Description of the stage

Examples

outs:
- md5: 3d1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d
  size: 1024000
  path: data/dataset.csv
outs:
- md5: a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6.dir
  size: 50000000
  nfiles: 1000
  path: data/images
Directory checksums end with .dir and represent a hash of all files within.
cmd: python preprocess.py
deps:
- path: raw_data.csv
  md5: 5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c
  size: 2048000
outs:
- md5: a304afb96060aad90176268345e10355
  size: 1536000
  path: processed_data.csv
md5: 9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e
outs:
- md5: e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2
  size: 5000000
  path: large_model.pkl
  remote: s3-large-files
  push: true
outs:
- md5: 1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d
  size: 1024
  path: metrics.json
  cache: false
Setting cache: false is useful for small files like metrics that don’t need caching.

dvc.yaml (Pipeline Files)

dvc.yaml files define multi-stage pipelines with dependencies, parameters, and outputs.

Basic Structure

stages:
  prepare:
    cmd: python prepare.py
    deps:
      - raw_data.csv
    outs:
      - prepared_data.csv

  train:
    cmd: python train.py
    deps:
      - prepared_data.csv
      - train.py
    params:
      - lr
      - epochs
    outs:
      - model.pkl
    metrics:
      - metrics.json:
          cache: false

Complete Schema

stages
object
required
Dictionary of pipeline stages, where keys are stage names
vars
array | object
Variables that can be referenced in the pipeline using ${var}
params
array
Global parameter files to track
metrics
array
Global metric files
plots
array
Global plot definitions
artifacts
object
Model registry artifacts
datasets
array
Dataset definitions

Advanced Examples

stages:
  train:
    cmd: python train.py
    deps:
      - data/train.csv
    params:
      - model.architecture
      - training.epochs
    outs:
      - model.pkl:
          desc: "Trained XGBoost model"
          remote: s3-models
    metrics:
      - metrics.json:
          cache: false
    plots:
      - plots/training_loss.csv:
          x: epoch
          y: loss
          title: "Training Loss"
stages:
  build:
    cmd:
      - echo "Building model..."
      - python build.py
      - echo "Build complete"
    outs:
      - model/
stages:
  process:
    foreach:
      - train
      - test
      - val
    do:
      cmd: python process.py ${item}
      deps:
        - raw/${item}.csv
      outs:
        - processed/${item}.csv
This creates three stages: process@train, process@test, and process@val.
stages:
  train:
    matrix:
      lr: [0.001, 0.01, 0.1]
      optimizer: [adam, sgd]
    cmd: python train.py --lr ${item.lr} --opt ${item.optimizer}
    outs:
      - models/${item.lr}-${item.optimizer}.pkl
vars:
  - data_dir: /mnt/data
  - model_name: xgboost_v2

stages:
  train:
    cmd: python train.py --data ${data_dir} --name ${model_name}
    deps:
      - ${data_dir}/train.csv
    outs:
      - models/${model_name}.pkl
stages:
  train:
    wdir: ../experiments
    cmd: python train.py
    deps:
      - ../data/dataset.csv
    outs:
      - model.pkl
Dependencies and outputs are relative to the dvc.yaml location, not the working directory.

dvc.lock (Lock Files)

dvc.lock is automatically generated and should not be edited manually. It ensures reproducibility by recording exact states.

Structure

schema: '2.0'
stages:
  train:
    cmd: python train.py
    deps:
    - path: data/train.csv
      md5: a304afb96060aad90176268345e10355
      size: 1536000
    - path: train.py
      md5: 5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c
      size: 4096
    params:
      params.yaml:
        lr: 0.001
        epochs: 100
    outs:
    - path: model.pkl
      md5: e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2
      size: 5000000

Schema Fields

schema
string
required
Lock file schema version (currently “2.0”)
stages
object
Locked state of each stage
datasets
array
Locked dataset states

Lock File Features

DVC uses the lock file to determine if a stage needs to be re-executed:
  • If dependencies or parameters change, the stage runs again
  • If the lock file matches current state, the stage is skipped
Always commit dvc.lock to version control. It’s essential for reproducibility and collaboration.

File Naming Conventions

Valid .dvc filenames

  • data.csv.dvc
  • model.pkl.dvc
  • images.dvc
  • any_name.dvc

Pipeline files

  • dvc.yaml (standard)
  • dvc.lock (auto-generated)
  • Custom: pipeline.yaml
  • Custom: train.dvc.yaml
Pipeline files must be named exactly dvc.yaml. The .dvc extension is only for single-stage tracking files.

Best Practices

Always track these files:
  • .dvc files
  • dvc.yaml
  • dvc.lock
  • params.yaml
Never track:
  • Actual data files
  • Cache directories
  • .dvc/config.local
Good:
stages:
  preprocess_data:
  train_model:
  evaluate_model:
Bad:
stages:
  step1:
  step2:
  step3:
stages:
  train:
    desc: |
      Train XGBoost model using preprocessed data.
      Outputs model.pkl and training metrics.
    cmd: python train.py
stages:
  train:
    params:
      - model.type
      - model.hyperparameters
      - training.epochs
      - training.batch_size
stages:
  train:
    meta:
      author: data-science-team
      model_version: v2.1
      experiment_id: exp-2024-001
# Create .dvc file
dvc add data/dataset.csv

# Create pipeline stage
dvc stage add -n train -d data.csv -o model.pkl python train.py

# Run pipeline and update dvc.lock
dvc repro

# Validate dvc.yaml syntax
dvc dag

# Show pipeline structure
dvc dag --md

Next Steps

Configuration

Learn about DVC configuration files

Remote Storage

Configure remote storage backends

Build docs developers (and LLMs) love