dvc pull

Synopsis

dvc pull [options] [<targets>...]

Description

The dvc pull command downloads DVC-tracked files from remote storage to your local cache and checks them out to your workspace. It’s a combination of dvc fetch and dvc checkout. This is analogous to git pull but for your data files. Use dvc pull to:

Get data after cloning a repository
Sync data after pulling Git changes
Download data for a specific branch or experiment
Restore missing or deleted data files

The command:

Downloads missing files from remote storage to local cache
Creates links (or copies) from cache to workspace
Updates your workspace to match the .dvc file specifications

You must configure a remote storage location before using dvc pull. Use dvc remote add to set up a remote, or check .dvc/config if your team has already configured one.

Options

targets

path

Limit command scope to specific tracked files/directories, .dvc files, or stage names. If not specified, pulls all tracked data.

dvc pull data/train.csv models/model.pkl

-r, --remote

string

Remote storage to pull from. If not specified, uses the default remote configured in .dvc/config.

dvc pull --remote s3storage

-j, --jobs

integer

default:"4 * cpu_count()"

Number of jobs to run simultaneously. Higher values increase parallelism but use more resources.

dvc pull --jobs 8

-a, --all-branches

boolean

default:"false"

Fetch cache for all Git branches.

dvc pull --all-branches

-T, --all-tags

boolean

default:"false"

Fetch cache for all Git tags.

dvc pull --all-tags

-A, --all-commits

boolean

default:"false"

Fetch cache for all Git commits.

This can download a very large amount of data. Use with caution.

-f, --force

boolean

default:"false"

Do not prompt when removing working directory files. Forces overwrite of modified files.

dvc pull --force

This will discard any local modifications to tracked files.

-d, --with-deps

boolean

default:"false"

Fetch cache for all dependencies of the specified target.

dvc pull --with-deps evaluate.dvc

-R, --recursive

boolean

default:"false"

Pull cache for subdirectories of the specified directory.

dvc pull --recursive experiments/

--run-cache

boolean

default:"false"

Fetch run history for all stages.

dvc pull --run-cache

--allow-missing

boolean

default:"false"

Ignore errors if some of the files or directories are missing from remote.

Useful in CI/CD where you may not need all data files.

Examples

Basic pull

Pull all tracked data from the default remote:

dvc pull

A       data/train.csv
A       data/test.csv
A       models/model.pkl
2 files fetched

Or if everything is up to date:

Everything is up to date.

Initial setup after cloning

Common workflow after cloning a repository:

# Clone the repository
git clone <repo-url>
cd <repo-name>

# Pull the data
dvc pull

Fetching data from remote...
A       data/dataset.csv
A       models/model.pkl
2 files fetched

Pull after Git changes

Sync data after pulling Git changes:

# Pull Git changes
git pull

# Pull corresponding data
dvc pull

M       data/train.csv
M       models/model.pkl
2 files fetched

Pull specific files

Pull only specific targets:

dvc pull data/train.csv.dvc

A       data/train.csv
1 file fetched

Pull from specific remote

Pull from a named remote:

dvc pull --remote backup

Pull with higher parallelism

Speed up pull with more concurrent jobs:

dvc pull --jobs 16

Force pull

Overwrite local changes:

dvc pull --force

This discards any uncommitted local changes to tracked files.

Pull with dependencies

Pull a pipeline stage and all its dependencies:

dvc pull --with-deps train.dvc

A       data/raw.csv
A       data/processed.csv
A       models/model.pkl
3 files fetched

Pull all branches

Fetch data for all branches (useful for caching):

dvc pull --all-branches

main:
        2 files fetched
experiment-1:
        3 files fetched
experiment-2:
        1 file fetched

Example workflows

Workflow 1: New team member

# 1. Clone repository
git clone https://github.com/company/ml-project.git
cd ml-project

# 2. Pull all data
dvc pull

# 3. Verify data is present
ls data/
ls models/

# 4. Start working
python train.py

Workflow 2: Switch branches

# Switch to experiment branch
git checkout experiment-branch

# Pull corresponding data
dvc pull

# Run experiments
python experiment.py

Workflow 3: Sync with team changes

# Get latest changes from team
git pull origin main

# Sync data
dvc pull

# Verify what changed
git log --oneline -5
dvc diff HEAD~1

Workflow 4: CI/CD pipeline

#!/bin/bash
# ci-script.sh

# Setup
git clone $REPO_URL
cd project

# Pull only necessary data
dvc pull --with-deps model_training.dvc

# Run training
python train.py

# Run tests
python test_model.py

Workflow 5: Selective data loading

# Pull only training data (not test or validation)
dvc pull data/train.csv.dvc

# Train model
python train.py

# Pull test data only when needed
dvc pull data/test.csv.dvc

# Evaluate
python evaluate.py

Setting up remotes

Before using dvc pull, ensure a remote is configured:

# Check configured remotes
dvc remote list

# Add a remote if needed
dvc remote add -d myremote s3://mybucket/path

# Or check team configuration
cat .dvc/config

Common remote types:

# Amazon S3
dvc remote add myremote s3://mybucket/dvc-storage

# Google Cloud Storage
dvc remote add myremote gs://mybucket/dvc-storage

# Azure Blob Storage
dvc remote add myremote azure://mycontainer/path

# SSH
dvc remote add myremote ssh://user@host/path/to/storage

Understanding pull output

File status indicators:

Symbol	Meaning
`A`	Added (new file created)
`M`	Modified (file was updated)
`D`	Deleted (file was removed)

Summary line:

5 files fetched

Or when everything is synced:

Everything is up to date.

Error handling

No remote configured

ERROR: no remote provided and no default remote set

Solution: Configure a remote:

dvc remote add -d origin <remote-url>

Authentication errors

ERROR: failed to pull data from the cloud

Solution: Configure credentials. For S3:

# Using environment variables
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret

# Or configure in DVC
dvc remote modify origin access_key_id your_key
dvc remote modify origin secret_access_key your_secret

Missing files in remote

ERROR: failed to pull data - file not found in remote

Solution: Either:

Ask teammate to push: dvc push
Use --allow-missing to skip missing files:
```
dvc pull --allow-missing
```

Network interruption

If pull is interrupted, simply run it again:

dvc pull

DVC will resume from where it left off.

Difference between pull, fetch, and checkout

Command	Downloads from remote	Updates workspace
`dvc pull`	✓	✓
`dvc fetch`	✓	✗
`dvc checkout`	✗	✓

Use dvc pull for most cases - it does both fetch and checkout.Use dvc fetch when you want to pre-download data without changing workspace.Use dvc checkout when data is already in cache and you just need to update workspace.

Performance tips

Increase parallelism - Use --jobs for faster downloads:

dvc pull --jobs 16

Pull selectively - Only pull what you need:

dvc pull data/train.csv.dvc models/

Use —allow-missing - In CI/CD, skip missing files to avoid errors:

dvc pull --allow-missing

Pre-fetch in CI - Cache DVC data between CI runs to speed up builds.

Best practices

Always pull after git pull: Keep data in sync with code
Pull before starting work: Ensure you have latest data
Use specific targets in CI: Only pull data needed for tests
Configure credentials securely: Use environment variables or IAM roles

dvc push - Upload data to remote storage
dvc fetch - Download to cache only
dvc checkout - Update workspace from cache
dvc status - Check sync status with remote

Overview

Data Management

Pipeline Commands

Experiment Commands

Metrics & Params

Remote Storage

Other Commands

Synopsis

Description

Options

Examples

Basic pull

Initial setup after cloning

Pull after Git changes

Pull specific files

Pull from specific remote

Pull with higher parallelism

Force pull

Pull with dependencies

Pull all branches

Example workflows

Workflow 1: New team member

Workflow 2: Switch branches

Workflow 3: Sync with team changes

Workflow 4: CI/CD pipeline

Workflow 5: Selective data loading

Setting up remotes

Understanding pull output

Error handling

No remote configured

Authentication errors

Missing files in remote

Network interruption

Difference between pull, fetch, and checkout

Performance tips

Best practices

Build docs developers (and LLMs) love

Overview

Data Management

Pipeline Commands

Experiment Commands

Metrics & Params

Remote Storage

Other Commands

​Synopsis

​Description

​Options

​Examples

​Basic pull

​Initial setup after cloning

​Pull after Git changes

​Pull specific files

​Pull from specific remote

​Pull with higher parallelism

​Force pull

​Pull with dependencies

​Pull all branches

​Example workflows

​Workflow 1: New team member

​Workflow 2: Switch branches

​Workflow 3: Sync with team changes

​Workflow 4: CI/CD pipeline

​Workflow 5: Selective data loading

​Setting up remotes

​Understanding pull output

​Error handling

​No remote configured

​Authentication errors

​Missing files in remote

​Network interruption

​Difference between pull, fetch, and checkout

​Performance tips

​Best practices

​Related commands

Build docs developers (and LLMs) love

Synopsis

Description

Options

Examples

Basic pull

Initial setup after cloning

Pull after Git changes

Pull specific files

Pull from specific remote

Pull with higher parallelism

Force pull

Pull with dependencies

Pull all branches

Example workflows

Workflow 1: New team member

Workflow 2: Switch branches

Workflow 3: Sync with team changes

Workflow 4: CI/CD pipeline

Workflow 5: Selective data loading

Setting up remotes

Understanding pull output

Error handling

No remote configured

Authentication errors

Missing files in remote

Network interruption

Difference between pull, fetch, and checkout

Performance tips

Best practices

Related commands