Skip to main content

Overview

This tutorial walks you through a complete DVC workflow:
  1. Initialize DVC in a Git repository
  2. Track datasets with version control
  3. Build a reproducible ML pipeline
  4. Set up remote storage and share data
This tutorial takes about 10 minutes. You’ll create a simple ML project that trains a model, tracks data and models, and pushes everything to remote storage.

Prerequisites

Before starting, ensure you have:
1

Install DVC

Follow the installation guide to install DVC on your system.
# Verify installation
dvc version
2

Install Git

DVC works with Git repositories. Make sure Git is installed:
git --version
3

Python Environment

You’ll need Python 3.9+ for this tutorial. We’ll use basic Python scripts.

Step 1: Initialize a DVC Project

Start by creating a new project and initializing Git and DVC:
# Create project directory
mkdir ml-project
cd ml-project

# Initialize Git
git init

# Initialize DVC
dvc init

# Commit DVC configuration
git commit -m "Initialize DVC"
The dvc init command created:
  • .dvc/ directory with configuration and cache
  • .dvc/.gitignore to exclude cache from Git
  • .dvc/config for DVC settings
  • .dvcignore for files DVC should ignore
These files were automatically staged in Git. DVC stores configuration in Git but keeps data separate.
DVC has enabled anonymous usage analytics by default. This helps improve the tool. You can opt out anytime by running dvc config core.analytics false. See analytics documentation for details.

Step 2: Track Your First Dataset

Let’s create a sample dataset and track it with DVC:
# Create a data directory
mkdir data

# Create a sample dataset (or use your own)
echo "feature1,feature2,label" > data/train.csv
for i in {1..1000}; do
  echo "$RANDOM,$RANDOM,$((RANDOM % 2))" >> data/train.csv
done
Now track this file with DVC:
# Add the dataset to DVC
dvc add data/train.csv
DVC created two new files:
  • data/train.csv.dvc — metadata file tracked by Git
  • data/.gitignore — tells Git to ignore the actual data file
# Check what DVC created
cat data/train.csv.dvc
You’ll see output like:
outs:
- md5: a3d0e7d8c6b5f4e3d2c1b0a9f8e7d6c5
  size: 50000
  hash: md5
  path: train.csv
Commit the metadata to Git:
git add data/train.csv.dvc data/.gitignore
git commit -m "Add training dataset"
The actual data/train.csv file is now in .dvc/cache (content-addressable storage) and linked to your workspace. Git only tracks the small .dvc file, keeping your repository lightweight.

Step 3: Create Training Scripts

Create simple training and preprocessing scripts:

Create preprocess.py

preprocess.py
import pandas as pd
import json

# Read raw data
df = pd.read_csv('data/train.csv')

# Simple preprocessing
df['feature1_norm'] = (df['feature1'] - df['feature1'].mean()) / df['feature1'].std()
df['feature2_norm'] = (df['feature2'] - df['feature2'].mean()) / df['feature2'].std()

# Save processed data
df.to_csv('data/processed.csv', index=False)

print(f"Processed {len(df)} rows")

Create train.py

train.py
import pandas as pd
import json
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load processed data
df = pd.read_csv('data/processed.csv')
X = df[['feature1_norm', 'feature2_norm']]
y = df['label']

# Train model
model = LogisticRegression()
model.fit(X, y)

# Save model
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Calculate and save metrics
predictions = model.predict(X)
accuracy = accuracy_score(y, predictions)

metrics = {'accuracy': accuracy}
with open('metrics.json', 'w') as f:
    json.dump(metrics, f)

print(f"Model accuracy: {accuracy:.4f}")

Install dependencies

pip install pandas scikit-learn
Commit your code:
git add preprocess.py train.py
git commit -m "Add training scripts"

Step 4: Build a DVC Pipeline

Instead of running scripts manually, create a DVC pipeline that tracks dependencies:
# Add preprocessing stage
dvc stage add -n preprocess \
  -d data/train.csv \
  -d preprocess.py \
  -o data/processed.csv \
  python preprocess.py

# Add training stage
dvc stage add -n train \
  -d data/processed.csv \
  -d train.py \
  -o model.pkl \
  -M metrics.json \
  python train.py
  • -n — Name of the stage
  • -d — Dependencies (if any change, stage will rerun)
  • -o — Outputs (tracked by DVC)
  • -M — Metrics file (tracked but not cached)
DVC created a dvc.yaml file defining your pipeline:
cat dvc.yaml
stages:
  preprocess:
    cmd: python preprocess.py
    deps:
    - data/train.csv
    - preprocess.py
    outs:
    - data/processed.csv
  train:
    cmd: python train.py
    deps:
    - data/processed.csv
    - train.py
    outs:
    - model.pkl
    metrics:
    - metrics.json:
        cache: false
Commit the pipeline:
git add dvc.yaml dvc.lock .gitignore
git commit -m "Create ML pipeline"

Step 5: Run the Pipeline

Execute your pipeline with a single command:
dvc repro
DVC will:
  1. Analyze dependencies
  2. Run stages in the correct order
  3. Track outputs
  4. Create dvc.lock with exact versions
The dvc.lock file records exact hashes of all dependencies and outputs, ensuring reproducibility. Always commit it to Git.
Check your metrics:
dvc metrics show
Output:
Path          accuracy
metrics.json  0.8723

Step 6: Make Changes and Reproduce

Let’s modify the training script and see DVC’s smart caching:
# Edit train.py to change model parameters
# For example, change: LogisticRegression() -> LogisticRegression(C=0.5)

# Reproduce the pipeline
dvc repro
DVC only reruns the train stage because preprocess hasn’t changed. This saves time on long-running pipelines.
Commit your changes:
git add train.py dvc.lock
git commit -m "Update model parameters"

Step 7: Set Up Remote Storage

To share data with your team, configure remote storage. DVC supports many storage types:
# Create a local "remote" directory
mkdir -p /tmp/dvc-storage

# Add it as a remote
dvc remote add -d myremote /tmp/dvc-storage

# Commit the configuration
git add .dvc/config
git commit -m "Configure local remote storage"
Check your .dvc/config file:
cat .dvc/config
Output:
[core]
    remote = myremote
['remote "myremote"']
    url = /tmp/dvc-storage

Step 8: Push Data to Remote

Upload your data and models to remote storage:
dvc push
DVC uploads:
  • data/train.csv
  • data/processed.csv
  • model.pkl
These files are now backed up and shareable.
Push data after committing to Git so teammates can access data at any commit:
git add . && git commit -m "Changes"
dvc push
git push

Step 9: Simulate Collaboration

Let’s see how a teammate would use your project:
# Clone repository (teammate's machine)
cd /tmp
git clone /path/to/ml-project ml-project-copy
cd ml-project-copy

# Pull data from remote
dvc pull
Now all data and models are downloaded from remote storage. Your teammate can:
  • View the exact data you used
  • Reproduce your results with dvc repro
  • Make their own changes
The dvc pull command downloads data based on .dvc files in the current Git commit. This ensures everyone works with consistent data versions.

Step 10: Track Experiments

DVC includes built-in experiment tracking:
# Run an experiment
dvc exp run -n baseline

# Modify hyperparameters in train.py
# Run another experiment
dvc exp run -n experiment-1

# Compare experiments
dvc exp show
Output:
┌────────────────────┬──────────┬───────┐
│ Experiment         │ accuracy │ Model │
├────────────────────┼──────────┼───────┤
│ workspace          │ 0.8723   │ -     │
│ baseline           │ 0.8723   │ model │
│ experiment-1       │ 0.8845   │ model │
└────────────────────┴──────────┴───────┘
Experiments are stored as Git commits that you can apply, compare, or branch from. Use dvc exp apply to restore an experiment to your workspace.

Common Workflows

Updating Data

When your dataset changes:
# Update the file
echo "new,data,row" >> data/train.csv

# Track the new version
dvc add data/train.csv

# Commit and push
git add data/train.csv.dvc
git commit -m "Update training data"
dvc push

Checking Status

See what’s changed:
# Check pipeline status
dvc status

# Check remote sync status
dvc status --cloud

Comparing Data Versions

View differences between commits:
# Show what changed
dvc diff

# Compare specific commits
dvc diff HEAD~1 HEAD

What’s Next?

You’ve learned the core DVC workflow! Explore more:

Core Concepts

Deep dive into how DVC works internally.

Command Reference

Explore all available DVC commands.

Building Pipelines

Learn advanced pipeline features and best practices.

Running Experiments

Master experiment tracking and comparison.

Remote Storage Guide

Configure and optimize remote storage.

Python API

Use DVC programmatically in your scripts.

Summary

In this tutorial, you:
1

Initialized DVC

Set up DVC in a Git repository with dvc init
2

Tracked Data

Versioned datasets using dvc add
3

Built a Pipeline

Created reproducible stages with dvc stage add
4

Ran the Pipeline

Executed and reproduced results with dvc repro
5

Configured Remote

Set up remote storage with dvc remote add
6

Shared Data

Pushed data to remote with dvc push
7

Collaborated

Pulled data on another machine with dvc pull
Join the DVC community on Discord to ask questions, share projects, and learn from other users.

Build docs developers (and LLMs) love