Quick Start Tutorial

Overview

This tutorial walks you through a complete DVC workflow:

Initialize DVC in a Git repository
Track datasets with version control
Build a reproducible ML pipeline
Set up remote storage and share data

This tutorial takes about 10 minutes. You’ll create a simple ML project that trains a model, tracks data and models, and pushes everything to remote storage.

Prerequisites

Before starting, ensure you have:

Install DVC

Follow the installation guide to install DVC on your system.

# Verify installation
dvc version

Install Git

DVC works with Git repositories. Make sure Git is installed:

git --version

Python Environment

You’ll need Python 3.9+ for this tutorial. We’ll use basic Python scripts.

Step 1: Initialize a DVC Project

Start by creating a new project and initializing Git and DVC:

# Create project directory
mkdir ml-project
cd ml-project

# Initialize Git
git init

# Initialize DVC
dvc init

# Commit DVC configuration
git commit -m "Initialize DVC"

What just happened?

The dvc init command created:

.dvc/ directory with configuration and cache
.dvc/.gitignore to exclude cache from Git
.dvc/config for DVC settings
.dvcignore for files DVC should ignore

These files were automatically staged in Git. DVC stores configuration in Git but keeps data separate.

DVC has enabled anonymous usage analytics by default. This helps improve the tool. You can opt out anytime by running dvc config core.analytics false. See analytics documentation for details.

Step 2: Track Your First Dataset

Let’s create a sample dataset and track it with DVC:

# Create a data directory
mkdir data

# Create a sample dataset (or use your own)
echo "feature1,feature2,label" > data/train.csv
for i in {1..1000}; do
  echo "$RANDOM,$RANDOM,$((RANDOM % 2))" >> data/train.csv
done

Now track this file with DVC:

# Add the dataset to DVC
dvc add data/train.csv

DVC created two new files:

data/train.csv.dvc — metadata file tracked by Git
data/.gitignore — tells Git to ignore the actual data file

# Check what DVC created
cat data/train.csv.dvc

You’ll see output like:

outs:
- md5: a3d0e7d8c6b5f4e3d2c1b0a9f8e7d6c5
  size: 50000
  hash: md5
  path: train.csv

Commit the metadata to Git:

git add data/train.csv.dvc data/.gitignore
git commit -m "Add training dataset"

The actual data/train.csv file is now in .dvc/cache (content-addressable storage) and linked to your workspace. Git only tracks the small .dvc file, keeping your repository lightweight.

Step 3: Create Training Scripts

Create simple training and preprocessing scripts:

Create `preprocess.py`

preprocess.py

import pandas as pd
import json

# Read raw data
df = pd.read_csv('data/train.csv')

# Simple preprocessing
df['feature1_norm'] = (df['feature1'] - df['feature1'].mean()) / df['feature1'].std()
df['feature2_norm'] = (df['feature2'] - df['feature2'].mean()) / df['feature2'].std()

# Save processed data
df.to_csv('data/processed.csv', index=False)

print(f"Processed {len(df)} rows")

Create `train.py`

train.py

import pandas as pd
import json
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load processed data
df = pd.read_csv('data/processed.csv')
X = df[['feature1_norm', 'feature2_norm']]
y = df['label']

# Train model
model = LogisticRegression()
model.fit(X, y)

# Save model
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Calculate and save metrics
predictions = model.predict(X)
accuracy = accuracy_score(y, predictions)

metrics = {'accuracy': accuracy}
with open('metrics.json', 'w') as f:
    json.dump(metrics, f)

print(f"Model accuracy: {accuracy:.4f}")

Install dependencies

pip install pandas scikit-learn

Commit your code:

git add preprocess.py train.py
git commit -m "Add training scripts"

Step 4: Build a DVC Pipeline

Instead of running scripts manually, create a DVC pipeline that tracks dependencies:

# Add preprocessing stage
dvc stage add -n preprocess \
  -d data/train.csv \
  -d preprocess.py \
  -o data/processed.csv \
  python preprocess.py

# Add training stage
dvc stage add -n train \
  -d data/processed.csv \
  -d train.py \
  -o model.pkl \
  -M metrics.json \
  python train.py

Understanding the flags

-n — Name of the stage
-d — Dependencies (if any change, stage will rerun)
-o — Outputs (tracked by DVC)
-M — Metrics file (tracked but not cached)

DVC created a dvc.yaml file defining your pipeline:

cat dvc.yaml

stages:
  preprocess:
    cmd: python preprocess.py
    deps:
    - data/train.csv
    - preprocess.py
    outs:
    - data/processed.csv
  train:
    cmd: python train.py
    deps:
    - data/processed.csv
    - train.py
    outs:
    - model.pkl
    metrics:
    - metrics.json:
        cache: false

Commit the pipeline:

git add dvc.yaml dvc.lock .gitignore
git commit -m "Create ML pipeline"

Step 5: Run the Pipeline

Execute your pipeline with a single command:

dvc repro

DVC will:

Analyze dependencies
Run stages in the correct order
Track outputs
Create dvc.lock with exact versions

The dvc.lock file records exact hashes of all dependencies and outputs, ensuring reproducibility. Always commit it to Git.

Check your metrics:

dvc metrics show

Output:

Path          accuracy
metrics.json  0.8723

Step 6: Make Changes and Reproduce

Let’s modify the training script and see DVC’s smart caching:

# Edit train.py to change model parameters
# For example, change: LogisticRegression() -> LogisticRegression(C=0.5)

# Reproduce the pipeline
dvc repro

DVC only reruns the train stage because preprocess hasn’t changed. This saves time on long-running pipelines.

Commit your changes:

git add train.py dvc.lock
git commit -m "Update model parameters"

Step 7: Set Up Remote Storage

To share data with your team, configure remote storage. DVC supports many storage types:

Local Remote (for testing)
AWS S3
Google Cloud Storage
Azure Blob Storage
SSH/SFTP

# Create a local "remote" directory
mkdir -p /tmp/dvc-storage

# Add it as a remote
dvc remote add -d myremote /tmp/dvc-storage

# Commit the configuration
git add .dvc/config
git commit -m "Configure local remote storage"

# Add S3 remote
dvc remote add -d myremote s3://mybucket/dvcstore

# Configure credentials (optional, uses AWS CLI config by default)
dvc remote modify myremote access_key_id YOUR_ACCESS_KEY
dvc remote modify myremote secret_access_key YOUR_SECRET_KEY

git add .dvc/config
git commit -m "Configure S3 remote storage"

Install AWS dependencies: pip install 'dvc[s3]'

# Add GCS remote
dvc remote add -d myremote gs://mybucket/dvcstore

# Authenticate with GCP
gcloud auth application-default login

git add .dvc/config
git commit -m "Configure GCS remote storage"

Install GCS dependencies: pip install 'dvc[gs]'

# Add Azure remote
dvc remote add -d myremote azure://mycontainer/dvcstore

# Configure credentials
dvc remote modify myremote account_name YOUR_ACCOUNT
dvc remote modify myremote account_key YOUR_KEY

git add .dvc/config
git commit -m "Configure Azure remote storage"

Install Azure dependencies: pip install 'dvc[azure]'

# Add SSH remote
dvc remote add -d myremote ssh://[email protected]/path/to/dvc-storage

# Configure SSH key (optional)
dvc remote modify myremote keyfile ~/.ssh/id_rsa

git add .dvc/config
git commit -m "Configure SSH remote storage"

Install SSH dependencies: pip install 'dvc[ssh]'

View remote configuration

Check your .dvc/config file:

cat .dvc/config

Output:

[core]
    remote = myremote
['remote "myremote"']
    url = /tmp/dvc-storage

Step 8: Push Data to Remote

Upload your data and models to remote storage:

dvc push

DVC uploads:

data/train.csv
data/processed.csv
model.pkl

These files are now backed up and shareable.

Push data after committing to Git so teammates can access data at any commit:

git add . && git commit -m "Changes"
dvc push
git push

Step 9: Simulate Collaboration

Let’s see how a teammate would use your project:

# Clone repository (teammate's machine)
cd /tmp
git clone /path/to/ml-project ml-project-copy
cd ml-project-copy

# Pull data from remote
dvc pull

Now all data and models are downloaded from remote storage. Your teammate can:

View the exact data you used
Reproduce your results with dvc repro
Make their own changes

The dvc pull command downloads data based on .dvc files in the current Git commit. This ensures everyone works with consistent data versions.

Step 10: Track Experiments

DVC includes built-in experiment tracking:

# Run an experiment
dvc exp run -n baseline

# Modify hyperparameters in train.py
# Run another experiment
dvc exp run -n experiment-1

# Compare experiments
dvc exp show

Output:

┌────────────────────┬──────────┬───────┐
│ Experiment         │ accuracy │ Model │
├────────────────────┼──────────┼───────┤
│ workspace          │ 0.8723   │ -     │
│ baseline           │ 0.8723   │ model │
│ experiment-1       │ 0.8845   │ model │
└────────────────────┴──────────┴───────┘

Experiments are stored as Git commits that you can apply, compare, or branch from. Use dvc exp apply to restore an experiment to your workspace.

Common Workflows

Updating Data

When your dataset changes:

# Update the file
echo "new,data,row" >> data/train.csv

# Track the new version
dvc add data/train.csv

# Commit and push
git add data/train.csv.dvc
git commit -m "Update training data"
dvc push

Checking Status

See what’s changed:

# Check pipeline status
dvc status

# Check remote sync status
dvc status --cloud

Comparing Data Versions

View differences between commits:

# Show what changed
dvc diff

# Compare specific commits
dvc diff HEAD~1 HEAD

What’s Next?

You’ve learned the core DVC workflow! Explore more:

Core Concepts

Deep dive into how DVC works internally.

Command Reference

Explore all available DVC commands.

Building Pipelines

Learn advanced pipeline features and best practices.

Running Experiments

Master experiment tracking and comparison.

Remote Storage Guide

Configure and optimize remote storage.

Python API

Use DVC programmatically in your scripts.

Summary

In this tutorial, you:

Initialized DVC

Set up DVC in a Git repository with dvc init

Tracked Data

Versioned datasets using dvc add

Built a Pipeline

Created reproducible stages with dvc stage add

Ran the Pipeline

Executed and reproduced results with dvc repro

Configured Remote

Set up remote storage with dvc remote add

Shared Data

Pushed data to remote with dvc push

Collaborated

Pulled data on another machine with dvc pull

Join the DVC community on Discord to ask questions, share projects, and learn from other users.

Get Started

Core Concepts

User Guide

Configuration

Quick Start Tutorial

Overview

Prerequisites

Step 1: Initialize a DVC Project

Step 2: Track Your First Dataset

Step 3: Create Training Scripts

Create `preprocess.py`

Create `train.py`

Install dependencies

Step 4: Build a DVC Pipeline

Step 5: Run the Pipeline

Step 6: Make Changes and Reproduce

Step 7: Set Up Remote Storage

Step 8: Push Data to Remote

Step 9: Simulate Collaboration

Step 10: Track Experiments

Common Workflows

Updating Data

Checking Status

Comparing Data Versions

What’s Next?

Core Concepts

Command Reference

Building Pipelines

Running Experiments

Remote Storage Guide

Python API

Summary

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guide

Configuration

​Overview

​Prerequisites

​Step 1: Initialize a DVC Project

​Step 2: Track Your First Dataset

​Step 3: Create Training Scripts

​Create preprocess.py

​Create train.py

​Install dependencies

​Step 4: Build a DVC Pipeline

​Step 5: Run the Pipeline

​Step 6: Make Changes and Reproduce

​Step 7: Set Up Remote Storage

​Step 8: Push Data to Remote

​Step 9: Simulate Collaboration

​Step 10: Track Experiments

​Common Workflows

​Updating Data

​Checking Status

​Comparing Data Versions

​What’s Next?

Core Concepts

Command Reference

Building Pipelines

Running Experiments

Remote Storage Guide

Python API

​Summary

Build docs developers (and LLMs) love

Overview

Prerequisites

Step 1: Initialize a DVC Project

Step 2: Track Your First Dataset

Step 3: Create Training Scripts

Create `preprocess.py`

Create `train.py`

Install dependencies

Step 4: Build a DVC Pipeline

Step 5: Run the Pipeline

Step 6: Make Changes and Reproduce

Step 7: Set Up Remote Storage

Step 8: Push Data to Remote

Step 9: Simulate Collaboration

Step 10: Track Experiments

Common Workflows

Updating Data

Checking Status

Comparing Data Versions

What’s Next?

Summary