Skip to main content

Overview

DVC helps you version control large data files and directories that are too big for Git. When you track data with DVC, it stores the actual data in a cache and creates small .dvc files that Git can track.
DVC doesn’t store your data in Git. Instead, it creates lightweight .dvc files (similar to pointers) that track your data’s location and version.

Basic Workflow

1

Add data files to DVC

Use dvc add to start tracking a file or directory:
dvc add data/dataset.csv
This command:
  • Moves the file to DVC’s cache (.dvc/cache)
  • Creates a data/dataset.csv.dvc file with metadata
  • Adds the data file to .gitignore automatically
Track entire directories the same way: dvc add data/images/
2

Commit the .dvc file to Git

Now commit the .dvc file to version control:
git add data/dataset.csv.dvc data/.gitignore
git commit -m "Add dataset to DVC"
The .dvc file is small (typically just a few lines) and safe to commit to Git.
3

Update your data

When your data changes, run dvc add again:
dvc add data/dataset.csv
git add data/dataset.csv.dvc
git commit -m "Update dataset with new records"
DVC automatically creates a new version while preserving the old one in cache.

Understanding .dvc Files

When you run dvc add data/dataset.csv, DVC creates a data/dataset.csv.dvc file:
data/dataset.csv.dvc
outs:
- md5: a3d2f7c8b9e1d4f5a6c7b8d9e0f1a2b3
  size: 1048576
  hash: md5
  path: dataset.csv
The .dvc file contains:
  • md5: Hash of the file content (for detecting changes)
  • size: File size in bytes
  • path: Relative path to the data file

Advanced Options

Track Multiple Files with Glob Patterns

Use the --glob flag to track files matching a pattern:
dvc add --glob 'data/*.csv'

Track Files Without Caching

Use --no-commit to create the .dvc file without copying data to cache:
dvc add --no-commit large_file.bin
With --no-commit, your data isn’t protected until you run dvc commit or dvc push.

Force Overwrite Existing Tracking

If you need to re-add a file that’s already tracked:
dvc add --force data/dataset.csv

Using .dvcignore

Like .gitignore, you can create a .dvcignore file to exclude files from DVC operations:
.dvcignore
# Ignore temporary files
*.tmp
*.temp

# Ignore specific directories
/data/cache/
/data/temp/

# Ignore patterns
**/.DS_Store
**/Thumbs.db
.dvcignore uses the same syntax as .gitignore. Patterns are applied to all DVC commands.

Common Commands

dvc add data/train.csv

Example Output

When you run dvc add, you’ll see output like this:
$ dvc add data/dataset.csv

100% Adding...|████████████████████████████████████|1/1 [00:00, 12.34file/s]

To track the changes with git, run:

    git add data/dataset.csv.dvc data/.gitignore

To enable auto staging, run:

    dvc config core.autostage true

Best Practices

Keep data organized

Store data in dedicated directories like data/raw/, data/processed/, data/external/

Track at the right level

Track directories when you have many related files, individual files when they change independently

Commit .dvc files

Always commit .dvc files to Git so your team can track data versions

Use .dvcignore

Exclude temporary or generated files from DVC operations to keep your cache clean

Next Steps

Remote Storage

Set up remote storage to share data with your team

Building Pipelines

Create reproducible ML pipelines with your tracked data

Build docs developers (and LLMs) love