Overview
DVC helps you version control large data files and directories that are too big for Git. When you track data with DVC, it stores the actual data in a cache and creates small.dvc files that Git can track.
DVC doesn’t store your data in Git. Instead, it creates lightweight
.dvc files (similar to pointers) that track your data’s location and version.Basic Workflow
Add data files to DVC
Use This command:
dvc add to start tracking a file or directory:- Moves the file to DVC’s cache (
.dvc/cache) - Creates a
data/dataset.csv.dvcfile with metadata - Adds the data file to
.gitignoreautomatically
Commit the .dvc file to Git
Now commit the The
.dvc file to version control:.dvc file is small (typically just a few lines) and safe to commit to Git.Understanding .dvc Files
When you rundvc add data/dataset.csv, DVC creates a data/dataset.csv.dvc file:
data/dataset.csv.dvc
The
.dvc file contains:- md5: Hash of the file content (for detecting changes)
- size: File size in bytes
- path: Relative path to the data file
Advanced Options
Track Multiple Files with Glob Patterns
Use the--glob flag to track files matching a pattern:
Track Files Without Caching
Use--no-commit to create the .dvc file without copying data to cache:
Force Overwrite Existing Tracking
If you need to re-add a file that’s already tracked:Using .dvcignore
Like.gitignore, you can create a .dvcignore file to exclude files from DVC operations:
.dvcignore
.dvcignore uses the same syntax as .gitignore. Patterns are applied to all DVC commands.Common Commands
Example Output
When you rundvc add, you’ll see output like this:
Best Practices
Keep data organized
Store data in dedicated directories like
data/raw/, data/processed/, data/external/Track at the right level
Track directories when you have many related files, individual files when they change independently
Commit .dvc files
Always commit
.dvc files to Git so your team can track data versionsUse .dvcignore
Exclude temporary or generated files from DVC operations to keep your cache clean
Next Steps
Remote Storage
Set up remote storage to share data with your team
Building Pipelines
Create reproducible ML pipelines with your tracked data