Skip to main content

Overview

DVC is designed to work seamlessly with Git, enabling teams to collaborate on ML projects just like software projects. While Git tracks code and metadata, DVC tracks data, models, and pipeline outputs.
The key to DVC collaboration: Git tracks .dvc files, while DVC remote storage holds the actual data.

The Collaboration Workflow

1

Initial setup

The first team member initializes the project:
# Initialize DVC
dvc init
git add .dvc .dvcignore
git commit -m "Initialize DVC"

# Configure remote storage
dvc remote add -d storage s3://team-bucket/project-data
git add .dvc/config
git commit -m "Configure DVC remote"

# Track data
dvc add data/dataset.csv
git add data/dataset.csv.dvc data/.gitignore
git commit -m "Track dataset with DVC"

# Push everything
git push origin main
dvc push
2

Team members clone

Other team members clone and get data:
# Clone repository
git clone https://github.com/team/project.git
cd project

# Pull data from DVC remote
dvc pull
Configure credentials locally: dvc remote modify --local storage profile myprofile
3

Make changes

Team members work independently:
# Create feature branch
git checkout -b feature/new-model

# Modify pipeline and data
dvc repro

# Track changes
git add dvc.yaml dvc.lock params.yaml
git commit -m "Improve model architecture"

# Push code and data
git push origin feature/new-model
dvc push
4

Review and merge

Team reviews and merges changes:
# Create pull request in GitHub/GitLab
gh pr create --title "Improve model architecture"

# After approval, merge
git checkout main
git merge feature/new-model

# Pull new data
dvc pull

Sharing Data

Adding New Data

When you add data to the project:
# Track new data
dvc add data/new_dataset.csv

# Commit metadata to Git
git add data/new_dataset.csv.dvc data/.gitignore
git commit -m "Add new dataset"

# Share with team
git push origin main
dvc push
Team members get the data:
git pull
dvc pull

Updating Existing Data

# Modify data (e.g., add more samples)
echo "new,data,rows" >> data/dataset.csv

# Update DVC tracking
dvc add data/dataset.csv

# Commit and push
git add data/dataset.csv.dvc
git commit -m "Update dataset with new samples"
git push
dvc push
DVC automatically versions your data. Old versions remain in the cache and remote storage.

Sharing Pipelines

Creating a Pipeline

One team member creates a pipeline:
# Build pipeline
dvc stage add -n preprocess \
  -d data/raw.csv \
  -o data/processed.csv \
  python preprocess.py

dvc stage add -n train \
  -d data/processed.csv \
  -p train.lr,train.epochs \
  -o models/model.pkl \
  python train.py

# Run and track
dvc repro

# Commit pipeline definition and lock file
git add dvc.yaml dvc.lock
git commit -m "Create preprocessing and training pipeline"

# Push code and outputs
git push
dvc push

Running a Shared Pipeline

Team members reproduce the pipeline:
# Get latest code
git pull

# Get pipeline outputs from remote
dvc pull

# Or run the pipeline locally
dvc repro
Use dvc pull if you just need the results. Use dvc repro if you want to re-run the pipeline.

Sharing Experiments

Push Experiments to Git Remote

# Run experiments
dvc exp run -n "baseline" -S train.lr=0.001
dvc exp run -n "high-lr" -S train.lr=0.01

# Push experiments to Git remote
dvc exp push origin baseline high-lr

# Or push all experiments
dvc exp push origin --all

Pull Team Members’ Experiments

# List experiments on remote
dvc exp list origin

# Pull specific experiments
dvc exp pull origin baseline high-lr

# Or pull all experiments
dvc exp pull origin --all

# View all experiments (including remote)
dvc exp show

Branch-Based Collaboration

Feature Branch Workflow

# Create feature branch
git checkout -b feature/data-augmentation

# Modify pipeline
dvc stage add -n augment \
  -d data/raw.csv \
  -o data/augmented.csv \
  python augment.py

dvc repro

# Commit and push
git add dvc.yaml dvc.lock
git commit -m "Add data augmentation stage"
git push origin feature/data-augmentation
dvc push

Merging Branches

# Switch to main
git checkout main

# Merge feature
git merge feature/data-augmentation

# Resolve any conflicts in dvc.yaml or dvc.lock
# Then pull corresponding data
dvc pull
If both branches modified the same pipeline stage, you may have merge conflicts in dvc.yaml and dvc.lock. Resolve them like any Git merge conflict.

Handling Merge Conflicts

Conflicts in .dvc Files

When two branches modify the same data:
# dvc.lock shows conflict
<<<<<<< HEAD
  prepare:
    cmd: python prepare.py --version 1
    md5: abc123
=======
  prepare:
    cmd: python prepare.py --version 2
    md5: def456
>>>>>>> feature/new-approach
Resolve by:
1

Choose the version you want

Edit the file to keep one version or combine them.
2

Re-run the pipeline

dvc repro
3

Commit resolved conflict

git add dvc.lock
git commit -m "Resolve pipeline conflict"

Conflicts in params.yaml

# params.yaml shows conflict
<<<<<<< HEAD
train:
  lr: 0.001
  epochs: 10
=======
train:
  lr: 0.01
  epochs: 20
>>>>>>> feature/hyperparams
Resolve, then re-run:
# Edit params.yaml to resolve
vim params.yaml

# Re-run pipeline with resolved params
dvc repro

# Commit
git add params.yaml dvc.lock
git commit -m "Resolve parameter conflict"

Working with Data Versions

Switch to a Previous Data Version

# Checkout old commit
git checkout HEAD~5

# Get corresponding data
dvc checkout

Compare Data Across Branches

# Compare metrics between branches
git checkout main
dvc metrics show

git checkout feature/new-model
dvc metrics show

# Or use diff
git diff main feature/new-model -- dvc.lock

Team Best Practices

Always push data

Run dvc push after git push to ensure team members can access your data

Pull before starting work

Run git pull && dvc pull to get the latest code and data

Use feature branches

Create branches for experiments and features, merge to main when ready

Document pipelines

Add descriptions to stages with --desc for team clarity

Share experiments

Push experiments with dvc exp push origin --all so the team can review

Automate with CI/CD

Set up CI/CD to run dvc repro and validate pipelines automatically

Setting Up CI/CD

GitHub Actions Example

.github/workflows/train.yml
name: Train Model

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      
      - name: Install dependencies
        run: |
          pip install dvc[s3] -r requirements.txt
      
      - name: Configure DVC
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc remote modify storage --local access_key_id $AWS_ACCESS_KEY_ID
          dvc remote modify storage --local secret_access_key $AWS_SECRET_ACCESS_KEY
      
      - name: Pull data
        run: dvc pull
      
      - name: Run pipeline
        run: dvc repro
      
      - name: Show metrics
        run: dvc metrics show

GitLab CI Example

.gitlab-ci.yml
stages:
  - train

train_model:
  stage: train
  image: python:3.10
  script:
    - pip install dvc[s3] -r requirements.txt
    - dvc remote modify storage --local access_key_id $AWS_ACCESS_KEY_ID
    - dvc remote modify storage --local secret_access_key $AWS_SECRET_ACCESS_KEY
    - dvc pull
    - dvc repro
    - dvc metrics show
  only:
    - main
    - merge_requests

Multi-Team Scenarios

Data Science Team + Engineering Team

1

Data science team

# DS team works on experiments
dvc exp run -n "bert-model" -S model.type=bert
dvc exp push origin bert-model

# When satisfied, promote to branch
dvc exp branch bert-model production-candidate
git push origin production-candidate
2

Engineering team

# Engineers pull candidate model
git checkout production-candidate
dvc pull

# Test in production environment
python test_production.py

# Deploy if successful
dvc push -r production

Regional Teams with Different Data

# Configure multiple remotes
dvc remote add us-data s3://us-bucket/data
dvc remote add eu-data s3://eu-bucket/data

# US team
dvc remote default us-data
dvc pull

# EU team  
dvc remote default eu-data
dvc pull

Access Control

Read-Only Access

Give some team members read-only access:
# Team member with read-only credentials
dvc pull  # Works
dvc push   # Fails with permission error

Separate Credentials

Each team member uses their own credentials:
# Configure credentials locally (not committed)
dvc remote modify --local storage profile alice-profile

Troubleshooting Collaboration Issues

Remember to run dvc pull after git pull:
git pull
dvc pull
Or combine them:
git pull && dvc pull
Usually safe to accept one version and re-run:
# Accept their version
git checkout --theirs dvc.lock

# Re-run pipeline
dvc repro

# Commit
git add dvc.lock
git commit -m "Resolve dvc.lock conflict"
If cache is out of sync:
# Remove local cache
dvc cache dir .dvc/cache

# Re-pull from remote
dvc pull -f
Check remote credentials:
# View configuration
dvc remote list
dvc config -l

# Test remote access
dvc push --dry-run

Complete Team Workflow Example

1

Project lead initializes

dvc init
dvc remote add -d storage s3://team-bucket/ml-project
git add .dvc .dvcignore .dvc/config
git commit -m "Initialize DVC"
git push
2

Data engineer adds data

git pull
dvc add data/raw/dataset.csv
git add data/raw/dataset.csv.dvc data/.gitignore
git commit -m "Add raw dataset"
git push
dvc push
3

ML engineer builds pipeline

git pull
dvc pull

dvc stage add -n train -d data/raw/dataset.csv -o models/model.pkl python train.py
dvc repro

git add dvc.yaml dvc.lock
git commit -m "Add training pipeline"
git push
dvc push
4

Team runs experiments

git pull
dvc pull

dvc exp run -n "exp1" -S lr=0.01
dvc exp run -n "exp2" -S lr=0.001

dvc exp push origin --all
5

Team reviews results

dvc exp pull origin --all
dvc exp show --sort-by accuracy

# Promote best experiment
dvc exp branch exp1 production
git checkout production
git push origin production

Next Steps

CI/CD Integration

Automate your ML workflows with continuous integration

Command Reference

Explore all DVC commands for advanced collaboration

Build docs developers (and LLMs) love