Remote Storage

What is Remote Storage?

Remote storage in DVC is where you store the actual data files, models, and artifacts tracked by DVC - separate from Git. While Git repositories contain .dvc files (metadata pointers), remote storage holds the real data. This enables teams to share large files without bloating Git repos.

Key Concept: Remote storage acts like a centralized cache. Team members push/pull data to/from remotes, similar to how Git push/pull works for code.

Why Remote Storage Matters

Collaboration: Share datasets and models with your team
Backup: Protect against data loss with cloud storage
Storage efficiency: Only download data you need
Version consistency: Ensure everyone uses the same data versions
Scale: Store terabytes of data without Git performance issues

How Remote Storage Works

The remote storage system is implemented in dvc/data_cloud.py. When you run commands like dvc push or dvc pull, DVC transfers data between your local cache and remote storage.

Architecture Overview

┌─────────────────┐
│  Your Workspace │  (working files)
│   data/         │
└────────┬────────┘
         │ dvc add/checkout
         ↓
┌─────────────────┐
│  Local Cache    │  (.dvc/cache)
│  Content-based  │
│  storage        │
└────────┬────────┘
         │ dvc push/pull
         ↓
┌─────────────────┐
│ Remote Storage  │  (S3, GCS, Azure, etc.)
│  Team's shared  │
│  cache          │
└─────────────────┘

Supported Storage Types

DVC supports many storage backends:

Amazon S3

AWS S3 buckets

Google Cloud

GCS buckets

Azure Blob

Azure storage

SSH/SFTP

Remote servers

HDFS

Hadoop filesystem

HTTP/HTTPS

Web servers

Local/NFS

Local or network drives

WebDAV

WebDAV servers

OSS

Alibaba Cloud OSS

Setting Up a Remote

Add a remote storage location:

# Amazon S3
dvc remote add -d myremote s3://mybucket/dvc-storage

# Google Cloud Storage
dvc remote add -d myremote gs://mybucket/dvc-storage

# Azure Blob Storage
dvc remote add -d myremote azure://mycontainer/path

# SSH
dvc remote add -d myremote ssh://[email protected]/path/to/storage

# Local or network drive
dvc remote add -d myremote /mnt/shared-storage

The -d flag sets it as the default remote.

Remote configurations are stored in .dvc/config (project) or .dvc/config.local (user-specific).

The DataCloud Class

Remote operations are managed by the DataCloud class in dvc/data_cloud.py:67-125:

class DataCloud:
    """Class that manages dvc remotes.
    
    Args:
        repo (dvc.repo.Repo): repo instance that belongs to the repo that
            we are working on.
    
    Raises:
        config.ConfigError: thrown when config has invalid format.
    """
    
    def __init__(self, repo):
        self.repo = repo
    
    def get_remote(
        self,
        name: Optional[str] = None,
        command: str = "<command>",
    ) -> "Remote":
        if not name:
            name = self.repo.config["core"].get("remote")
        
        if name:
            from dvc.fs import get_cloud_fs
            
            cls, config, fs_path = get_cloud_fs(self.repo.config, name=name)
            # ... create and return Remote instance

Remote Class

Each remote is represented by a Remote object from dvc/data_cloud.py:21-50:

class Remote:
    def __init__(self, name: str, path: str, fs: "FileSystem", *, index=None, **config):
        self.path = path
        self.fs = fs
        self.name = name
        self.index = index
        
        self.worktree: bool = config.pop("worktree", False)
        self.config = config
    
    @cached_property
    def odb(self) -> "HashFileDB":
        from dvc.cachemgr import CacheManager
        from dvc_data.hashfile.db import get_odb
        from dvc_data.hashfile.hash import DEFAULT_ALGORITHM
        
        path = self.path
        if self.worktree:
            path = self.fs.join(path, ".dvc", CacheManager.FILES_DIR, DEFAULT_ALGORITHM)
        else:
            path = self.fs.join(path, CacheManager.FILES_DIR, DEFAULT_ALGORITHM)
        return get_odb(self.fs, path, hash_name=DEFAULT_ALGORITHM, **self.config)

Pushing Data

Upload tracked files to remote storage:

# Push all tracked data
dvc push

# Push specific files
dvc push data/train.csv.dvc

# Push to specific remote
dvc push -r myremote

# Push with multiple parallel jobs
dvc push -j 8

The push implementation in dvc/data_cloud.py:168-198:

def push(
    self,
    objs: Iterable["HashInfo"],
    jobs: Optional[int] = None,
    remote: Optional[str] = None,
    odb: Optional["HashFileDB"] = None,
) -> "TransferResult":
    """Push data items in a cloud-agnostic way.
    
    Args:
        objs: objects to push to the cloud.
        jobs: number of jobs that can be running simultaneously.
        remote: optional name of remote to push to.
            By default remote from core.remote config option is used.
        odb: optional ODB to push to. Overrides remote.
    """
    if odb is not None:
        return self._push(objs, jobs=jobs, odb=odb)
    legacy_objs, default_objs = _split_legacy_hash_infos(objs)
    result = TransferResult(set(), set())
    if legacy_objs:
        odb = self.get_remote_odb(remote, "push", hash_name="md5-dos2unix")
        t, f = self._push(legacy_objs, jobs=jobs, odb=odb)
        result.transferred.update(t)
        result.failed.update(f)
    if default_objs:
        odb = self.get_remote_odb(remote, "push")
        t, f = self._push(default_objs, jobs=jobs, odb=odb)
        result.transferred.update(t)
        result.failed.update(f)
    return result

Use dvc push -j 16 to speed up uploads with parallel transfers. Adjust based on your network and storage.

Pulling Data

Download tracked files from remote storage:

# Pull all tracked data
dvc pull

# Pull specific files
dvc pull data/train.csv.dvc

# Pull from specific remote
dvc pull -r myremote

# Pull with multiple parallel jobs
dvc pull -j 8

DVC only downloads files that:

Are missing from your local cache
Have checksums different from what’s in cache
Are required by your current .dvc files

Checking Status

See what would be pushed/pulled:

# Check status against default remote
dvc status -c

# Check against specific remote
dvc status -r myremote -c

Output shows:

Files that would be pushed
Files that would be pulled
Files not in cache

Remote Configuration

Configure remote settings in .dvc/config:

[core]
    remote = myremote

['remote "myremote"']
    url = s3://mybucket/dvc-storage
    region = us-west-2
    profile = myprofile

Or use commands:

# Set remote-specific options
dvc remote modify myremote region us-west-2
dvc remote modify myremote profile myprofile

# For S3
dvc remote modify myremote access_key_id YOUR_KEY
dvc remote modify myremote secret_access_key YOUR_SECRET

Security: Never commit credentials to Git. Use .dvc/config.local for sensitive settings or environment variables.

Authentication

AWS S3

# Use AWS credentials file
dvc remote modify myremote profile myprofile

# Use environment variables
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."

# Use IAM role (on EC2)
# No configuration needed

Google Cloud Storage

# Use service account
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"

# Or configure explicitly
dvc remote modify myremote credentialpath path/to/credentials.json

Azure Blob Storage

# Use connection string
dvc remote modify myremote connection_string "..."

# Or use account name and key
dvc remote modify myremote account_name myaccount
dvc remote modify myremote account_key "..."

SSH

# Use SSH key
dvc remote modify myremote keyfile ~/.ssh/id_rsa

# Use password (not recommended)
dvc remote modify myremote password mypassword

# Use SSH agent
dvc remote modify myremote ask_password true

Advanced Features

Version-Aware Remotes

For cloud storage with versioning (S3, GCS, Azure):

dvc remote modify myremote version_aware true

This enables:

Tracking specific object versions
Time travel to previous data states
Protection against accidental overwrites

Worktree Remotes

Store data alongside another DVC repository:

dvc remote add -d shared /mnt/shared-dvc-repo
dvc remote modify shared worktree true

This treats the remote as a full DVC workspace, not just a cache.

Read-Only Remotes

Prevent accidental pushes:

dvc remote modify myremote read_only true

Custom Storage Paths

Organize remote storage:

# Store by branch
dvc remote modify myremote url s3://bucket/${DVC_EXP_NAME}

# Store by user
dvc remote modify myremote url s3://bucket/${USER}

Transfer Optimization

Parallel Jobs

# Use more parallel transfers
dvc push -j 16
dvc pull -j 16

Partial Downloads

Only download what you need:

# Pull specific pipeline stage
dvc repro --pull train

# Pull only metrics (small files)
dvc pull --run-cache

Retry Configuration

# Increase retry attempts for unreliable connections
dvc remote modify myremote retry_count 10

# Increase timeout
dvc remote modify myremote timeout 300

Hash Algorithm Handling

DVC handles different hash algorithms for different storage types. From dvc/data_cloud.py:52-64:

def _split_legacy_hash_infos(
    hash_infos: Iterable["HashInfo"],
) -> tuple[set["HashInfo"], set["HashInfo"]]:
    from dvc.cachemgr import LEGACY_HASH_NAMES
    
    legacy = set()
    default = set()
    for hi in hash_infos:
        if hi.name in LEGACY_HASH_NAMES:
            legacy.add(hi)
        else:
            default.add(hi)
    return legacy, default

This ensures compatibility between DVC versions and storage types:

Local: Uses md5 or md5-dos2unix
S3/GCS: Can use etag for efficiency
HDFS: Uses native checksum

Data Transfer Flow

When you run dvc push, the data flow is:

Collect objects: Gather all tracked files needing upload
Check remote: Query which files already exist remotely
Transfer: Upload missing files using storage backend
Verify: Confirm successful uploads

From dvc/data_cloud.py:157-166:

def transfer(
    self,
    src_odb: "HashFileDB",
    dest_odb: "HashFileDB",
    objs: Iterable["HashInfo"],
    **kwargs,
) -> "TransferResult":
    from dvc_data.hashfile.transfer import transfer
    
    return transfer(src_odb, dest_odb, objs, **kwargs)

Multiple Remotes

You can configure multiple remotes for different purposes:

# Default remote for team
dvc remote add -d team s3://team-bucket/dvc-storage

# Personal backup
dvc remote add backup gs://my-personal-bucket/backup

# Local cache for quick access
dvc remote add local /mnt/fast-storage

# Push to all remotes
dvc push -r team
dvc push -r backup

Storage Costs and Optimization

Deduplication

DVC’s content-addressable storage means identical files are stored once, even across projects

Compression

Some storage backends support transparent compression (configure per-remote)

Lifecycle Policies

Use cloud provider features to archive or delete old data automatically

Regional Storage

Store data in regions close to compute for faster access

Troubleshooting

Connection Issues

# Test remote connectivity
dvc remote list
dvc status -c -r myremote

# Increase verbosity
dvc push -v
dvc pull -vv

Permission Errors

# Verify credentials
aws s3 ls s3://mybucket/  # For S3
gsutil ls gs://mybucket/  # For GCS

# Check DVC configuration
dvc config remote.myremote.url

Large File Performance

# Use more parallel jobs
dvc push -j 32

# Skip checksum verification (faster but risky)
dvc push --no-verify

dvc remote - Manage remote storage configurations
dvc push - Upload data to remote storage
dvc pull - Download data from remote storage
dvc fetch - Download to cache without checking out
dvc status - Check data status vs remote

Get Started

Core Concepts

User Guide

Configuration

​What is Remote Storage?

​Why Remote Storage Matters

​How Remote Storage Works

​Architecture Overview

​Supported Storage Types

Amazon S3

Google Cloud

Azure Blob

SSH/SFTP

HDFS

HTTP/HTTPS

Local/NFS

WebDAV

OSS

​Setting Up a Remote

​The DataCloud Class

​Remote Class

​Pushing Data

​Pulling Data

​Checking Status

​Remote Configuration

​Authentication

​AWS S3

​Google Cloud Storage

​Azure Blob Storage

​SSH

​Advanced Features

​Version-Aware Remotes

​Worktree Remotes

​Read-Only Remotes

​Custom Storage Paths

​Transfer Optimization

​Parallel Jobs

​Partial Downloads

​Retry Configuration

​Hash Algorithm Handling

​Data Transfer Flow

​Multiple Remotes

​Storage Costs and Optimization

Deduplication

Compression

Lifecycle Policies

Regional Storage

​Troubleshooting

​Connection Issues

​Permission Errors

​Large File Performance

​Related Commands

​Next Steps

Data Versioning

Experiments

Build docs developers (and LLMs) love

What is Remote Storage?

Why Remote Storage Matters

How Remote Storage Works

Architecture Overview

Supported Storage Types

Setting Up a Remote

The DataCloud Class

Remote Class

Pushing Data

Pulling Data

Checking Status

Remote Configuration

Authentication

AWS S3

Google Cloud Storage

Azure Blob Storage

SSH

Advanced Features

Version-Aware Remotes

Worktree Remotes

Read-Only Remotes

Custom Storage Paths

Transfer Optimization

Parallel Jobs

Partial Downloads

Retry Configuration

Hash Algorithm Handling

Data Transfer Flow

Multiple Remotes

Storage Costs and Optimization

Troubleshooting

Connection Issues

Permission Errors

Large File Performance

Related Commands

Next Steps