Skip to main content

Overview

Data Version Control (DVC) extends Git’s capabilities to handle large datasets, models, and ML artifacts. The AI Data Science Service uses DVC to ensure reproducible experiments, efficient data sharing, and complete lineage tracking.
Core Principle: “You can’t reproduce an ML experiment without versioning the data.”DVC makes datasets as version-controlled as code, enabling true reproducibility.

Why DVC?

The Problem with Git Alone

Git excels at tracking code but struggles with large files:

Git Limitations

  • Large files bloat repository size
  • Slow clone and fetch operations
  • GitHub file size limits (100MB)
  • Expensive storage costs

DVC Solutions

  • Lightweight .dvc pointers in Git
  • Data stored in cloud (S3, Azure, GCS)
  • Fast operations with selective sync
  • Unlimited dataset sizes

DVC Benefits

Every experiment can be exactly reproduced with the correct data version:
# Checkout specific experiment
git checkout experiment-branch

# Pull exact data version
dvc pull

# Reproduce results
python training/training.py --config config.yaml
Guarantees:
  • Same code + same data = same results
  • Complete experiment lineage
  • Audit trail for compliance
Team members efficiently share datasets without repository bloat:
# Team member A adds new dataset
dvc add datasets/new_data.csv
git add datasets/new_data.csv.dvc
git commit -m "Add new training data"
dvc push
git push

# Team member B downloads only what they need
git pull
dvc pull datasets/new_data.csv.dvc
Advantages:
  • Selective data downloads
  • Parallel workflows
  • No data duplication
DVC deduplicates data and uses remote storage:
Repository Size Comparison:

Without DVC:
├── Git repo: 10GB (code + data)
└── Clone time: 20 minutes

With DVC:
├── Git repo: 50MB (code + .dvc files)
├── Remote storage: 10GB (S3/DagsHub)
└── Clone time: 30 seconds + selective data pull
Cost Savings:
  • S3 storage: ~$0.023/GB/month
  • GitHub LFS: ~$5/50GB/month
  • DagsHub: Free tier available
Track data evolution alongside code:
# View data version history
git log -- datasets/credit_data.csv.dvc

# Compare data between branches
git diff main experiment -- datasets/credit_data.csv.dvc

# Rollback data to previous version
git checkout HEAD~1 -- datasets/credit_data.csv.dvc
dvc checkout

DVC Architecture

How DVC Works

Key Concept: .dvc files are small text files (~100 bytes) containing checksums and metadata. They’re committed to Git, while actual data lives in remote storage.

DVC File Anatomy

german_credit_risk_v1.0.0_training_23012026.csv.dvc
outs:
- md5: 3086216ff1ff32f7626554e730cccc91
  size: 53393
  hash: md5
  path: german_credit_risk_v1.0.0_training_23012026.csv
Components:
  • md5: Content hash for integrity verification
  • size: File size in bytes
  • hash: Hashing algorithm used
  • path: Local file path reference

Project Implementation

Dataset Structure

datasets/
└── credit_score_dataset/
    ├── german_credit_risk_v1.0.0_training_23012026.csv      # Actual data (not in Git)
    └── german_credit_risk_v1.0.0_training_23012026.csv.dvc  # DVC pointer (in Git)

DVC Configuration

.dvcignore
# Add patterns of files dvc should ignore
# Improves performance by excluding unnecessary files

*.pyc
__pycache__/
.git/
.env
mlruns/
*.log

Remote Storage Setup

DVC supports multiple storage backends:
# Configure S3 remote
dvc remote add -d storage s3://my-bucket/dvc-store
dvc remote modify storage region us-west-2

# Set credentials (use AWS IAM roles in production)
dvc remote modify storage access_key_id YOUR_ACCESS_KEY
dvc remote modify storage secret_access_key YOUR_SECRET_KEY
Best Practices:
  • Use IAM roles instead of access keys
  • Enable server-side encryption
  • Configure lifecycle policies for cost optimization

Common Workflows

Adding New Datasets

# 1. Add dataset to DVC tracking
dvc add datasets/credit_score_dataset/new_training_data.csv

# 2. Commit the .dvc file to Git
git add datasets/credit_score_dataset/new_training_data.csv.dvc
git add datasets/credit_score_dataset/.gitignore  # Auto-generated
git commit -m "Add new training dataset v2.0.0"

# 3. Push data to remote storage
dvc push

# 4. Push code changes to Git
git push
DVC automatically updates .gitignore to exclude the actual data file, preventing accidental Git commits.

Retrieving Data

# Clone repository (gets code + .dvc files)
git clone https://github.com/org/project.git
cd project

# Pull all tracked data
dvc pull

# Or pull specific files
dvc pull datasets/credit_score_dataset/german_credit_risk_v1.0.0_training_23012026.csv.dvc

Updating Datasets

# Modify the dataset
python scripts/update_data.py

# Update DVC tracking
dvc add datasets/credit_score_dataset/german_credit_risk_v1.0.0_training_23012026.csv

# Commit changes
git add datasets/credit_score_dataset/german_credit_risk_v1.0.0_training_23012026.csv.dvc
git commit -m "Update training data: added 1000 new samples"

# Push updated data
dvc push
git push

Switching Data Versions

# Checkout previous data version
git checkout v1.0 -- datasets/credit_score_dataset/german_credit_risk_v1.0.0_training_23012026.csv.dvc

# Sync data to match .dvc file
dvc checkout

# Verify data version
dvc status

Integration with Training

Reproducible Training Pipeline

training/training.py
import os
import logging

logger = logging.getLogger(__name__)

def train(args):
    # Reference specific dataset version
    dataset_path = os.path.join(
        os.path.dirname(__file__),
        "..",
        "..",
        "..",
        "datasets",
        "credit_score_dataset",
        "german_credit_risk_v1.0.0_training_23012026.csv"  # Versioned filename
    )
    dataset_path = os.path.abspath(dataset_path)
    
    # Verify data exists
    if not os.path.exists(dataset_path):
        logger.error(f"Dataset not found at {dataset_path}")
        logger.info("Run 'dvc pull' to download dataset")
        raise FileNotFoundError(f"Dataset missing: {dataset_path}")
    
    logger.info(f"Loading data from {dataset_path}")
    df = load_data(dataset_path)
    
    # Log dataset metadata to MLflow
    import mlflow
    mlflow.log_param("dataset_path", dataset_path)
    mlflow.log_param("dataset_size", len(df))
    mlflow.log_param("dataset_md5", get_file_md5(dataset_path))
    
    # Continue training...
Best Practices:
  • Include dataset version in filename
  • Log dataset metadata to MLflow
  • Fail fast if data is missing
  • Document data requirements in README

Version Naming Convention

<dataset_name>_v<major>.<minor>.<patch>_<purpose>_<date>.csv

Examples:
german_credit_risk_v1.0.0_training_23012026.csv
german_credit_risk_v1.1.0_validation_15022026.csv
german_credit_risk_v2.0.0_training_01032026.csv

Version Semantics:
- Major: Breaking schema changes
- Minor: Backward-compatible additions
- Patch: Bug fixes, data corrections

Advanced Patterns

DVC Pipelines

Define reproducible data processing workflows:
dvc.yaml
stages:
  preprocess:
    cmd: python scripts/preprocess.py
    deps:
      - scripts/preprocess.py
      - datasets/raw/german_credit_risk_raw.csv
    outs:
      - datasets/processed/german_credit_risk_v1.0.0_training_23012026.csv
    params:
      - preprocess.remove_outliers
      - preprocess.handle_missing
  
  train:
    cmd: python training/training.py --config config/model_config_001.yaml
    deps:
      - training/training.py
      - datasets/processed/german_credit_risk_v1.0.0_training_23012026.csv
      - config/models-configs/model_config_001.yaml
    outs:
      - model/model_weights_001.pth
    metrics:
      - mlruns/metrics.json
Run Pipeline:
# Execute entire pipeline
dvc repro

# Run specific stage
dvc repro train

# Visualize pipeline
dvc dag

Data Registry

Centralize datasets across projects:
# Create data registry repository
git init data-registry
cd data-registry

# Add datasets
dvc add datasets/credit_score/german_credit_risk_v1.0.0.csv
dvc add datasets/energy_imports/energy_data_v1.0.0.csv
git add .
git commit -m "Initialize data registry"
dvc push

# Import data in projects
cd /path/to/ml-project
dvc import https://github.com/org/data-registry datasets/credit_score/german_credit_risk_v1.0.0.csv

Experiment Tracking Integration

Combine DVC with MLflow for complete lineage:
import mlflow
import dvc.api

# Get dataset metadata from DVC
with dvc.api.open(
    'datasets/credit_score_dataset/german_credit_risk_v1.0.0_training_23012026.csv',
    mode='r'
) as f:
    df = pd.read_csv(f)

# Log to MLflow
with mlflow.start_run():
    mlflow.log_param("dataset_url", dvc.api.get_url(
        'datasets/credit_score_dataset/german_credit_risk_v1.0.0_training_23012026.csv'
    ))
    mlflow.log_param("dvc_version", dvc.__version__)
    mlflow.log_param("git_commit", get_git_commit())

Best Practices

Version Everything

  • Track all datasets, models, and artifacts
  • Include version in filenames
  • Document data lineage
  • Tag releases in Git

Selective Sync

  • Pull only required datasets
  • Use .dvcignore for temporary files
  • Organize datasets by project
  • Clean unused cache regularly

Remote Storage

  • Never commit large files to Git
  • Use appropriate storage tier (hot/cold)
  • Enable encryption at rest
  • Configure backup policies

Access Control

  • Use IAM roles instead of keys
  • Separate dev/prod remotes
  • Audit data access logs
  • Rotate credentials regularly

Troubleshooting

Symptom:
FileNotFoundError: Dataset not found at datasets/credit_score_dataset/german_credit_risk_v1.0.0_training_23012026.csv
Solution:
# Check DVC status
dvc status

# Pull missing data
dvc pull

# Verify data integrity
dvc checkout
Symptom:
ERROR: failed to push data to remote - access denied
Solutions:
# Check remote configuration
dvc remote list
dvc remote list --local

# Verify credentials
dvc remote modify storage --local access_key_id YOUR_KEY

# Test connection
dvc pull --remote storage
Check cache size:
# View cache location
dvc cache dir

# Check cache size
du -sh .dvc/cache
Clean unused cache:
# Remove unreferenced cache
dvc gc --workspace

# Remove all cache (use with caution)
dvc gc --all-branches --all-tags --all-commits
When .dvc files conflict:
# Accept one version
git checkout --ours datasets/data.csv.dvc
# or
git checkout --theirs datasets/data.csv.dvc

# Sync data
dvc checkout

# Complete merge
git add datasets/data.csv.dvc
git commit

Quick Reference

# Initialize DVC in repository
dvc init

# Add dataset to tracking
dvc add datasets/data.csv

# Configure remote storage
dvc remote add -d myremote s3://bucket/path

# Push data to remote
dvc push

# Pull data from remote
dvc pull

# Pull specific files
dvc pull datasets/data.csv.dvc

# Check status
dvc status

# Update tracked file
dvc add datasets/data.csv

# Switch to specific version
git checkout <commit> -- datasets/data.csv.dvc
dvc checkout

# Remove from tracking
dvc remove datasets/data.csv.dvc

# View data URL without pulling
dvc get-url datasets/data.csv

# Import from another repository
dvc import <repo_url> datasets/data.csv

Next Steps

MLOps Architecture

Integrate DVC with MLflow for complete lineage

Project Structure

Understand where data fits in the project

Build docs developers (and LLMs) love