Skip to main content

What is Remote Storage?

Remote storage in DVC is where you store the actual data files, models, and artifacts tracked by DVC - separate from Git. While Git repositories contain .dvc files (metadata pointers), remote storage holds the real data. This enables teams to share large files without bloating Git repos.
Key Concept: Remote storage acts like a centralized cache. Team members push/pull data to/from remotes, similar to how Git push/pull works for code.

Why Remote Storage Matters

  • Collaboration: Share datasets and models with your team
  • Backup: Protect against data loss with cloud storage
  • Storage efficiency: Only download data you need
  • Version consistency: Ensure everyone uses the same data versions
  • Scale: Store terabytes of data without Git performance issues

How Remote Storage Works

The remote storage system is implemented in dvc/data_cloud.py. When you run commands like dvc push or dvc pull, DVC transfers data between your local cache and remote storage.

Architecture Overview

┌─────────────────┐
│  Your Workspace │  (working files)
│   data/         │
└────────┬────────┘
         │ dvc add/checkout

┌─────────────────┐
│  Local Cache    │  (.dvc/cache)
│  Content-based  │
│  storage        │
└────────┬────────┘
         │ dvc push/pull

┌─────────────────┐
│ Remote Storage  │  (S3, GCS, Azure, etc.)
│  Team's shared  │
│  cache          │
└─────────────────┘

Supported Storage Types

DVC supports many storage backends:

Amazon S3

AWS S3 buckets

Google Cloud

GCS buckets

Azure Blob

Azure storage

SSH/SFTP

Remote servers

HDFS

Hadoop filesystem

HTTP/HTTPS

Web servers

Local/NFS

Local or network drives

WebDAV

WebDAV servers

OSS

Alibaba Cloud OSS

Setting Up a Remote

Add a remote storage location:
# Amazon S3
dvc remote add -d myremote s3://mybucket/dvc-storage

# Google Cloud Storage
dvc remote add -d myremote gs://mybucket/dvc-storage

# Azure Blob Storage
dvc remote add -d myremote azure://mycontainer/path

# SSH
dvc remote add -d myremote ssh://[email protected]/path/to/storage

# Local or network drive
dvc remote add -d myremote /mnt/shared-storage
The -d flag sets it as the default remote.
Remote configurations are stored in .dvc/config (project) or .dvc/config.local (user-specific).

The DataCloud Class

Remote operations are managed by the DataCloud class in dvc/data_cloud.py:67-125:
class DataCloud:
    """Class that manages dvc remotes.
    
    Args:
        repo (dvc.repo.Repo): repo instance that belongs to the repo that
            we are working on.
    
    Raises:
        config.ConfigError: thrown when config has invalid format.
    """
    
    def __init__(self, repo):
        self.repo = repo
    
    def get_remote(
        self,
        name: Optional[str] = None,
        command: str = "<command>",
    ) -> "Remote":
        if not name:
            name = self.repo.config["core"].get("remote")
        
        if name:
            from dvc.fs import get_cloud_fs
            
            cls, config, fs_path = get_cloud_fs(self.repo.config, name=name)
            # ... create and return Remote instance

Remote Class

Each remote is represented by a Remote object from dvc/data_cloud.py:21-50:
class Remote:
    def __init__(self, name: str, path: str, fs: "FileSystem", *, index=None, **config):
        self.path = path
        self.fs = fs
        self.name = name
        self.index = index
        
        self.worktree: bool = config.pop("worktree", False)
        self.config = config
    
    @cached_property
    def odb(self) -> "HashFileDB":
        from dvc.cachemgr import CacheManager
        from dvc_data.hashfile.db import get_odb
        from dvc_data.hashfile.hash import DEFAULT_ALGORITHM
        
        path = self.path
        if self.worktree:
            path = self.fs.join(path, ".dvc", CacheManager.FILES_DIR, DEFAULT_ALGORITHM)
        else:
            path = self.fs.join(path, CacheManager.FILES_DIR, DEFAULT_ALGORITHM)
        return get_odb(self.fs, path, hash_name=DEFAULT_ALGORITHM, **self.config)

Pushing Data

Upload tracked files to remote storage:
# Push all tracked data
dvc push

# Push specific files
dvc push data/train.csv.dvc

# Push to specific remote
dvc push -r myremote

# Push with multiple parallel jobs
dvc push -j 8
The push implementation in dvc/data_cloud.py:168-198:
def push(
    self,
    objs: Iterable["HashInfo"],
    jobs: Optional[int] = None,
    remote: Optional[str] = None,
    odb: Optional["HashFileDB"] = None,
) -> "TransferResult":
    """Push data items in a cloud-agnostic way.
    
    Args:
        objs: objects to push to the cloud.
        jobs: number of jobs that can be running simultaneously.
        remote: optional name of remote to push to.
            By default remote from core.remote config option is used.
        odb: optional ODB to push to. Overrides remote.
    """
    if odb is not None:
        return self._push(objs, jobs=jobs, odb=odb)
    legacy_objs, default_objs = _split_legacy_hash_infos(objs)
    result = TransferResult(set(), set())
    if legacy_objs:
        odb = self.get_remote_odb(remote, "push", hash_name="md5-dos2unix")
        t, f = self._push(legacy_objs, jobs=jobs, odb=odb)
        result.transferred.update(t)
        result.failed.update(f)
    if default_objs:
        odb = self.get_remote_odb(remote, "push")
        t, f = self._push(default_objs, jobs=jobs, odb=odb)
        result.transferred.update(t)
        result.failed.update(f)
    return result
Use dvc push -j 16 to speed up uploads with parallel transfers. Adjust based on your network and storage.

Pulling Data

Download tracked files from remote storage:
# Pull all tracked data
dvc pull

# Pull specific files
dvc pull data/train.csv.dvc

# Pull from specific remote
dvc pull -r myremote

# Pull with multiple parallel jobs
dvc pull -j 8
DVC only downloads files that:
  • Are missing from your local cache
  • Have checksums different from what’s in cache
  • Are required by your current .dvc files

Checking Status

See what would be pushed/pulled:
# Check status against default remote
dvc status -c

# Check against specific remote
dvc status -r myremote -c
Output shows:
  • Files that would be pushed
  • Files that would be pulled
  • Files not in cache

Remote Configuration

Configure remote settings in .dvc/config:
[core]
    remote = myremote

['remote "myremote"']
    url = s3://mybucket/dvc-storage
    region = us-west-2
    profile = myprofile
Or use commands:
# Set remote-specific options
dvc remote modify myremote region us-west-2
dvc remote modify myremote profile myprofile

# For S3
dvc remote modify myremote access_key_id YOUR_KEY
dvc remote modify myremote secret_access_key YOUR_SECRET
Security: Never commit credentials to Git. Use .dvc/config.local for sensitive settings or environment variables.

Authentication

AWS S3

# Use AWS credentials file
dvc remote modify myremote profile myprofile

# Use environment variables
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."

# Use IAM role (on EC2)
# No configuration needed

Google Cloud Storage

# Use service account
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"

# Or configure explicitly
dvc remote modify myremote credentialpath path/to/credentials.json

Azure Blob Storage

# Use connection string
dvc remote modify myremote connection_string "..."

# Or use account name and key
dvc remote modify myremote account_name myaccount
dvc remote modify myremote account_key "..."

SSH

# Use SSH key
dvc remote modify myremote keyfile ~/.ssh/id_rsa

# Use password (not recommended)
dvc remote modify myremote password mypassword

# Use SSH agent
dvc remote modify myremote ask_password true

Advanced Features

Version-Aware Remotes

For cloud storage with versioning (S3, GCS, Azure):
dvc remote modify myremote version_aware true
This enables:
  • Tracking specific object versions
  • Time travel to previous data states
  • Protection against accidental overwrites

Worktree Remotes

Store data alongside another DVC repository:
dvc remote add -d shared /mnt/shared-dvc-repo
dvc remote modify shared worktree true
This treats the remote as a full DVC workspace, not just a cache.

Read-Only Remotes

Prevent accidental pushes:
dvc remote modify myremote read_only true

Custom Storage Paths

Organize remote storage:
# Store by branch
dvc remote modify myremote url s3://bucket/${DVC_EXP_NAME}

# Store by user
dvc remote modify myremote url s3://bucket/${USER}

Transfer Optimization

Parallel Jobs

# Use more parallel transfers
dvc push -j 16
dvc pull -j 16

Partial Downloads

Only download what you need:
# Pull specific pipeline stage
dvc repro --pull train

# Pull only metrics (small files)
dvc pull --run-cache

Retry Configuration

# Increase retry attempts for unreliable connections
dvc remote modify myremote retry_count 10

# Increase timeout
dvc remote modify myremote timeout 300

Hash Algorithm Handling

DVC handles different hash algorithms for different storage types. From dvc/data_cloud.py:52-64:
def _split_legacy_hash_infos(
    hash_infos: Iterable["HashInfo"],
) -> tuple[set["HashInfo"], set["HashInfo"]]:
    from dvc.cachemgr import LEGACY_HASH_NAMES
    
    legacy = set()
    default = set()
    for hi in hash_infos:
        if hi.name in LEGACY_HASH_NAMES:
            legacy.add(hi)
        else:
            default.add(hi)
    return legacy, default
This ensures compatibility between DVC versions and storage types:
  • Local: Uses md5 or md5-dos2unix
  • S3/GCS: Can use etag for efficiency
  • HDFS: Uses native checksum

Data Transfer Flow

When you run dvc push, the data flow is:
  1. Collect objects: Gather all tracked files needing upload
  2. Check remote: Query which files already exist remotely
  3. Transfer: Upload missing files using storage backend
  4. Verify: Confirm successful uploads
From dvc/data_cloud.py:157-166:
def transfer(
    self,
    src_odb: "HashFileDB",
    dest_odb: "HashFileDB",
    objs: Iterable["HashInfo"],
    **kwargs,
) -> "TransferResult":
    from dvc_data.hashfile.transfer import transfer
    
    return transfer(src_odb, dest_odb, objs, **kwargs)

Multiple Remotes

You can configure multiple remotes for different purposes:
# Default remote for team
dvc remote add -d team s3://team-bucket/dvc-storage

# Personal backup
dvc remote add backup gs://my-personal-bucket/backup

# Local cache for quick access
dvc remote add local /mnt/fast-storage

# Push to all remotes
dvc push -r team
dvc push -r backup

Storage Costs and Optimization

Deduplication

DVC’s content-addressable storage means identical files are stored once, even across projects

Compression

Some storage backends support transparent compression (configure per-remote)

Lifecycle Policies

Use cloud provider features to archive or delete old data automatically

Regional Storage

Store data in regions close to compute for faster access

Troubleshooting

Connection Issues

# Test remote connectivity
dvc remote list
dvc status -c -r myremote

# Increase verbosity
dvc push -v
dvc pull -vv

Permission Errors

# Verify credentials
aws s3 ls s3://mybucket/  # For S3
gsutil ls gs://mybucket/  # For GCS

# Check DVC configuration
dvc config remote.myremote.url

Large File Performance

# Use more parallel jobs
dvc push -j 32

# Skip checksum verification (faster but risky)
dvc push --no-verify
  • dvc remote - Manage remote storage configurations
  • dvc push - Upload data to remote storage
  • dvc pull - Download data from remote storage
  • dvc fetch - Download to cache without checking out
  • dvc status - Check data status vs remote

Next Steps

Data Versioning

Understand how data is tracked and versioned locally

Experiments

Share experiment results via remote storage

Build docs developers (and LLMs) love