DVC Files Format

DVC uses several file formats to track data, define pipelines, and lock reproducible states. This guide explains the structure and purpose of each file type.

File Types Overview

.dvc Files

Single-stage files for tracking data

dvc.yaml

Multi-stage pipeline definitions

dvc.lock

Lock file for reproducibility

.dvc Files (Single-Stage Files)

.dvc files are used to track individual data files or directories. They’re created with dvc add or when defining single-stage operations.

Basic Structure

A typical .dvc file contains output metadata:

outs:
- md5: a304afb96060aad90176268345e10355
  size: 37891850
  path: model.pkl

Complete Schema

outs

array

required

List of output files or directories tracked by this .dvc file

Show Output Object Properties

path

string

required

Path to the file or directory

md5

string

MD5 checksum of the file or directory

size

integer

Size in bytes

nfiles

integer

Number of files (for directories)

cache

boolean

default:"true"

Whether to cache this output

persist

boolean

default:"false"

Keep output file between runs

remote

string

Specific remote to use for this output

push

boolean

default:"true"

Whether to push this output to remote storage

deps

array

List of dependencies (for single-stage files with commands)

cmd

string

Command to execute (for single-stage files)

wdir

string

Working directory for the command

md5

string

MD5 checksum of the stage definition

frozen

boolean

default:"false"

Whether the stage is frozen (won’t be re-executed)

always_changed

boolean

default:"false"

Always consider this stage as changed

Examples

Tracking a single file

outs:
- md5: 3d1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d
  size: 1024000
  path: data/dataset.csv

Tracking a directory

outs:
- md5: a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6.dir
  size: 50000000
  nfiles: 1000
  path: data/images

Directory checksums end with .dir and represent a hash of all files within.

Single-stage with command

cmd: python preprocess.py
deps:
- path: raw_data.csv
  md5: 5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c
  size: 2048000
outs:
- md5: a304afb96060aad90176268345e10355
  size: 1536000
  path: processed_data.csv
md5: 9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e

Output with custom remote

outs:
- md5: e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2
  size: 5000000
  path: large_model.pkl
  remote: s3-large-files
  push: true

Non-cached output

outs:
- md5: 1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d
  size: 1024
  path: metrics.json
  cache: false

Setting cache: false is useful for small files like metrics that don’t need caching.

dvc.yaml (Pipeline Files)

dvc.yaml files define multi-stage pipelines with dependencies, parameters, and outputs.

Basic Structure

stages:
  prepare:
    cmd: python prepare.py
    deps:
      - raw_data.csv
    outs:
      - prepared_data.csv

  train:
    cmd: python train.py
    deps:
      - prepared_data.csv
      - train.py
    params:
      - lr
      - epochs
    outs:
      - model.pkl
    metrics:
      - metrics.json:
          cache: false

Complete Schema

stages

object

required

Dictionary of pipeline stages, where keys are stage names

Show Stage Object Properties

cmd

string | array

required

Command to execute. Can be a string or list of commands

wdir

string

Working directory for the command (relative to dvc.yaml location)

deps

array

List of dependency file paths

params

array

List of parameters from params.yaml. Can be:

Simple strings: ["lr", "epochs"]
Custom file: [{"config.yaml": ["model.type"]}]

outs

array

List of output files. Can be paths or objects with options

metrics

array

List of metric files (automatically set cache: false)

plots

array

List of plot files with optional configuration

frozen

boolean

default:"false"

Prevent stage from running

always_changed

boolean

default:"false"

Always run this stage

Advanced Examples

Stage with detailed outputs

stages:
  train:
    cmd: python train.py
    deps:
      - data/train.csv
    params:
      - model.architecture
      - training.epochs
    outs:
      - model.pkl:
          desc: "Trained XGBoost model"
          remote: s3-models
    metrics:
      - metrics.json:
          cache: false
    plots:
      - plots/training_loss.csv:
          x: epoch
          y: loss
          title: "Training Loss"

Multi-command stage

stages:
  build:
    cmd:
      - echo "Building model..."
      - python build.py
      - echo "Build complete"
    outs:
      - model/

Foreach iteration

stages:
  process:
    foreach:
      - train
      - test
      - val
    do:
      cmd: python process.py ${item}
      deps:
        - raw/${item}.csv
      outs:
        - processed/${item}.csv

This creates three stages: process@train, process@test, and process@val.

Matrix for hyperparameter sweep

stages:
  train:
    matrix:
      lr: [0.001, 0.01, 0.1]
      optimizer: [adam, sgd]
    cmd: python train.py --lr ${item.lr} --opt ${item.optimizer}
    outs:
      - models/${item.lr}-${item.optimizer}.pkl

Using variables

vars:
  - data_dir: /mnt/data
  - model_name: xgboost_v2

stages:
  train:
    cmd: python train.py --data ${data_dir} --name ${model_name}
    deps:
      - ${data_dir}/train.csv
    outs:
      - models/${model_name}.pkl

Working directory example

stages:
  train:
    wdir: ../experiments
    cmd: python train.py
    deps:
      - ../data/dataset.csv
    outs:
      - model.pkl

Dependencies and outputs are relative to the dvc.yaml location, not the working directory.

dvc.lock (Lock Files)

dvc.lock is automatically generated and should not be edited manually. It ensures reproducibility by recording exact states.

Structure

schema: '2.0'
stages:
  train:
    cmd: python train.py
    deps:
    - path: data/train.csv
      md5: a304afb96060aad90176268345e10355
      size: 1536000
    - path: train.py
      md5: 5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c
      size: 4096
    params:
      params.yaml:
        lr: 0.001
        epochs: 100
    outs:
    - path: model.pkl
      md5: e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2
      size: 5000000

Schema Fields

schema

string

required

Lock file schema version (currently “2.0”)

stages

object

Locked state of each stage

Show Locked Stage Properties

cmd

string | array

Exact command that was executed

deps

array

Dependencies with checksumsEach dependency includes:

path: File path
md5: Checksum
size: File size in bytes

params

object

Parameter files with exact values usedFormat: {"params.yaml": {"lr": 0.001}}

outs

array

Outputs with checksums and metadataEach output includes:

path: File path
md5: Checksum (.dir suffix for directories)
size: Size in bytes
nfiles: File count (for directories)
files: File listing (when tracked with --with-files)

datasets

array

Locked dataset states

Lock File Features

DVC uses the lock file to determine if a stage needs to be re-executed:

If dependencies or parameters change, the stage runs again
If the lock file matches current state, the stage is skipped

Always commit dvc.lock to version control. It’s essential for reproducibility and collaboration.

File Naming Conventions

Valid .dvc filenames

data.csv.dvc
model.pkl.dvc
images.dvc
any_name.dvc

Pipeline files

dvc.yaml (standard)
dvc.lock (auto-generated)
Custom: pipeline.yaml ❌
Custom: train.dvc.yaml ❌

Pipeline files must be named exactly dvc.yaml. The .dvc extension is only for single-stage tracking files.

Best Practices

Commit all DVC files to Git

Always track these files:

.dvc files
dvc.yaml
dvc.lock
params.yaml

Never track:

Actual data files
Cache directories
.dvc/config.local

Use descriptive stage names

Good:

stages:
  preprocess_data:
  train_model:
  evaluate_model:

Bad:

stages:
  step1:
  step2:
  step3:

Add descriptions to stages

stages:
  train:
    desc: |
      Train XGBoost model using preprocessed data.
      Outputs model.pkl and training metrics.
    cmd: python train.py

Organize parameters by purpose

stages:
  train:
    params:
      - model.type
      - model.hyperparameters
      - training.epochs
      - training.batch_size

Use meaningful metadata

stages:
  train:
    meta:
      author: data-science-team
      model_version: v2.1
      experiment_id: exp-2024-001

# Create .dvc file
dvc add data/dataset.csv

# Create pipeline stage
dvc stage add -n train -d data.csv -o model.pkl python train.py

# Run pipeline and update dvc.lock
dvc repro

# Validate dvc.yaml syntax
dvc dag

# Show pipeline structure
dvc dag --md

Get Started

Core Concepts

User Guide

Configuration

DVC Files Format

File Types Overview

.dvc Files

dvc.yaml

dvc.lock

.dvc Files (Single-Stage Files)

Basic Structure

Complete Schema

Examples

dvc.yaml (Pipeline Files)

Basic Structure

Complete Schema

Advanced Examples

dvc.lock (Lock Files)

Structure

Schema Fields

Lock File Features

File Naming Conventions

Valid .dvc filenames

Pipeline files

Best Practices

Next Steps

Configuration

Remote Storage

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guide

Configuration

​File Types Overview

.dvc Files

dvc.yaml

dvc.lock

​.dvc Files (Single-Stage Files)

​Basic Structure

​Complete Schema

​Examples

​dvc.yaml (Pipeline Files)

​Basic Structure

​Complete Schema

​Advanced Examples

​dvc.lock (Lock Files)

​Structure

​Schema Fields

​Lock File Features

​File Naming Conventions

Valid .dvc filenames

Pipeline files

​Best Practices

​Related Commands

​Next Steps

Configuration

Remote Storage

Build docs developers (and LLMs) love

File Types Overview

.dvc Files (Single-Stage Files)

Basic Structure

Complete Schema

Examples

dvc.yaml (Pipeline Files)

Basic Structure

Complete Schema

Advanced Examples

dvc.lock (Lock Files)

Structure

Schema Fields

Lock File Features

File Naming Conventions

Best Practices

Related Commands

Next Steps