What is Data Versioning?
Data versioning in DVC is a lightweight mechanism to track large files and directories without storing them in Git. Instead of bloating your Git repository with massive datasets or model files, DVC stores metadata pointers (.dvc files) in Git while the actual data lives in a cache.
Key Concept: DVC uses content-addressable storage, meaning files are identified by their hash (checksum), not their name. If the content changes, the hash changes, creating a new version.
Why Data Versioning Matters
Traditional version control systems like Git aren’t designed for large binary files. DVC solves this by:- Keeping Git lightweight: Only small metadata files are tracked in Git
- Enabling true reproducibility: Every code version links to specific data versions
- Supporting massive datasets: Handle gigabytes or terabytes of data efficiently
- Providing data lineage: Track how data evolves over time
How Data Versioning Works
When you add a file to DVC usingdvc add, several things happen under the hood:
1. Content Hashing
DVC computes a hash (checksum) of your file or directory. The implementation can be found indvc/output.py:
DVC 3.x uses
md5 as the default hash algorithm. For directories, DVC creates a .dir file containing hashes of all files inside.2. Cache Storage
The actual file content is moved (or linked) to the DVC cache directory at.dvc/cache/files/md5/. The cache uses a content-addressable structure:
3. DVC File Creation
A.dvc file is created with metadata about your tracked file:
.dvc file goes into Git, serving as a pointer to the actual data in the cache.
4. Workspace Linking
DVC creates a link (reflink, hardlink, symlink, or copy) from your workspace to the cached file. The implementation indvc/output.py:752-800 handles this:
Directory Versioning
Directories are handled specially in DVC. Fromdvc/output.py:220-232:
- Each file inside is hashed individually
- A
.dirfile is created in cache containing JSON with all file hashes - The directory hash is the hash of this
.dirfile - The
.dvcfile stores just the directory hash and metadata
Example: Directory .dvc File
Example: Directory .dvc File
.dir file in cache contains:Version Switching
To switch between data versions:dvc checkout command reads the .dvc files in your workspace and restores files from cache.
Data Lineage
Track Changes
Use Git history to see how
.dvc files changedCompare Versions
View differences between data versions
File Status and Changes
DVC tracks whether files have changed from what’s recorded in.dvc files. The implementation in dvc/output.py:599-642:
dvc status- Shows which tracked files have changeddvc diff- Shows differences between versions
Output Types
DVC supports different output types with special handling:| Type | Purpose | Schema Field |
|---|---|---|
| Regular Output | Standard tracked files | cache: true |
| Metric | Numerical results for comparison | metric: true |
| Plot | Data for visualization | plot: true |
| No-cache | Tracked but not cached | cache: false |
| Persist | Not removed on dvc repro | persist: true |
dvc/output.py:281-303:
Related Commands
dvc add- Start tracking a file or directorydvc checkout- Restore tracked files from cachedvc status- Check for changes in tracked filesdvc push- Upload tracked data to remote storagedvc pull- Download tracked data from remote storage
Next Steps
Remote Storage
Learn how to share versioned data with your team
Pipelines
Connect versioned data with processing steps