Tracking Data

Overview

DVC helps you version control large data files and directories that are too big for Git. When you track data with DVC, it stores the actual data in a cache and creates small .dvc files that Git can track.

DVC doesn’t store your data in Git. Instead, it creates lightweight .dvc files (similar to pointers) that track your data’s location and version.

Basic Workflow

Add data files to DVC

Use dvc add to start tracking a file or directory:

dvc add data/dataset.csv

This command:

Moves the file to DVC’s cache (.dvc/cache)
Creates a data/dataset.csv.dvc file with metadata
Adds the data file to .gitignore automatically

Track entire directories the same way: dvc add data/images/

Commit the .dvc file to Git

Now commit the .dvc file to version control:

git add data/dataset.csv.dvc data/.gitignore
git commit -m "Add dataset to DVC"

The .dvc file is small (typically just a few lines) and safe to commit to Git.

Update your data

When your data changes, run dvc add again:

dvc add data/dataset.csv
git add data/dataset.csv.dvc
git commit -m "Update dataset with new records"

DVC automatically creates a new version while preserving the old one in cache.

Understanding .dvc Files

When you run dvc add data/dataset.csv, DVC creates a data/dataset.csv.dvc file:

data/dataset.csv.dvc

outs:
- md5: a3d2f7c8b9e1d4f5a6c7b8d9e0f1a2b3
  size: 1048576
  hash: md5
  path: dataset.csv

The .dvc file contains:

md5: Hash of the file content (for detecting changes)
size: File size in bytes
path: Relative path to the data file

Advanced Options

Track Multiple Files with Glob Patterns

Use the --glob flag to track files matching a pattern:

dvc add --glob 'data/*.csv'

Track Files Without Caching

Use --no-commit to create the .dvc file without copying data to cache:

dvc add --no-commit large_file.bin

With --no-commit, your data isn’t protected until you run dvc commit or dvc push.

Force Overwrite Existing Tracking

If you need to re-add a file that’s already tracked:

dvc add --force data/dataset.csv

Using .dvcignore

Like .gitignore, you can create a .dvcignore file to exclude files from DVC operations:

.dvcignore

# Ignore temporary files
*.tmp
*.temp

# Ignore specific directories
/data/cache/
/data/temp/

# Ignore patterns
**/.DS_Store
**/Thumbs.db

.dvcignore uses the same syntax as .gitignore. Patterns are applied to all DVC commands.

Common Commands

dvc add data/train.csv

Example Output

When you run dvc add, you’ll see output like this:

$ dvc add data/dataset.csv

100% Adding...|████████████████████████████████████|1/1 [00:00, 12.34file/s]

To track the changes with git, run:

    git add data/dataset.csv.dvc data/.gitignore

To enable auto staging, run:

    dvc config core.autostage true

Best Practices

Keep data organized

Store data in dedicated directories like data/raw/, data/processed/, data/external/

Track at the right level

Track directories when you have many related files, individual files when they change independently

Commit .dvc files

Always commit .dvc files to Git so your team can track data versions

Use .dvcignore

Exclude temporary or generated files from DVC operations to keep your cache clean

Get Started

Core Concepts

User Guide

Configuration

Tracking Data

Overview

Basic Workflow

Understanding .dvc Files

Advanced Options

Track Multiple Files with Glob Patterns

Track Files Without Caching

Force Overwrite Existing Tracking

Using .dvcignore

Common Commands

Example Output

Best Practices

Keep data organized

Track at the right level

Commit .dvc files

Use .dvcignore

Next Steps

Remote Storage

Building Pipelines

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guide

Configuration

​Overview

​Basic Workflow

​Understanding .dvc Files

​Advanced Options

​Track Multiple Files with Glob Patterns

​Track Files Without Caching

​Force Overwrite Existing Tracking

​Using .dvcignore

​Common Commands

​Example Output

​Best Practices

Keep data organized

Track at the right level

Commit .dvc files

Use .dvcignore

​Next Steps

Remote Storage

Building Pipelines

Build docs developers (and LLMs) love

Overview

Basic Workflow

Understanding .dvc Files

Advanced Options

Track Multiple Files with Glob Patterns

Track Files Without Caching

Force Overwrite Existing Tracking

Using .dvcignore

Common Commands

Example Output

Best Practices

Next Steps