This tutorial walks you through a complete DVC workflow:
Initialize DVC in a Git repository
Track datasets with version control
Build a reproducible ML pipeline
Set up remote storage and share data
This tutorial takes about 10 minutes. You’ll create a simple ML project that trains a model, tracks data and models, and pushes everything to remote storage.
These files were automatically staged in Git. DVC stores configuration in Git but keeps data separate.
DVC has enabled anonymous usage analytics by default. This helps improve the tool. You can opt out anytime by running dvc config core.analytics false. See analytics documentation for details.
Let’s create a sample dataset and track it with DVC:
# Create a data directorymkdir data# Create a sample dataset (or use your own)echo "feature1,feature2,label" > data/train.csvfor i in {1..1000}; do echo "$RANDOM,$RANDOM,$((RANDOM % 2))" >> data/train.csvdone
Now track this file with DVC:
# Add the dataset to DVCdvc add data/train.csv
DVC created two new files:
data/train.csv.dvc — metadata file tracked by Git
data/.gitignore — tells Git to ignore the actual data file
git add data/train.csv.dvc data/.gitignoregit commit -m "Add training dataset"
The actual data/train.csv file is now in .dvc/cache (content-addressable storage) and linked to your workspace. Git only tracks the small .dvc file, keeping your repository lightweight.
To share data with your team, configure remote storage. DVC supports many storage types:
Local Remote (for testing)
AWS S3
Google Cloud Storage
Azure Blob Storage
SSH/SFTP
# Create a local "remote" directorymkdir -p /tmp/dvc-storage# Add it as a remotedvc remote add -d myremote /tmp/dvc-storage# Commit the configurationgit add .dvc/configgit commit -m "Configure local remote storage"
# Run an experimentdvc exp run -n baseline# Modify hyperparameters in train.py# Run another experimentdvc exp run -n experiment-1# Compare experimentsdvc exp show
Output:
┌────────────────────┬──────────┬───────┐│ Experiment │ accuracy │ Model │├────────────────────┼──────────┼───────┤│ workspace │ 0.8723 │ - ││ baseline │ 0.8723 │ model ││ experiment-1 │ 0.8845 │ model │└────────────────────┴──────────┴───────┘
Experiments are stored as Git commits that you can apply, compare, or branch from. Use dvc exp apply to restore an experiment to your workspace.
# Update the fileecho "new,data,row" >> data/train.csv# Track the new versiondvc add data/train.csv# Commit and pushgit add data/train.csv.dvcgit commit -m "Update training data"dvc push