Remote storage in DVC is where you store the actual data files, models, and artifacts tracked by DVC - separate from Git. While Git repositories contain .dvc files (metadata pointers), remote storage holds the real data. This enables teams to share large files without bloating Git repos.
Key Concept: Remote storage acts like a centralized cache. Team members push/pull data to/from remotes, similar to how Git push/pull works for code.
The remote storage system is implemented in dvc/data_cloud.py. When you run commands like dvc push or dvc pull, DVC transfers data between your local cache and remote storage.
Remote operations are managed by the DataCloud class in dvc/data_cloud.py:67-125:
class DataCloud: """Class that manages dvc remotes. Args: repo (dvc.repo.Repo): repo instance that belongs to the repo that we are working on. Raises: config.ConfigError: thrown when config has invalid format. """ def __init__(self, repo): self.repo = repo def get_remote( self, name: Optional[str] = None, command: str = "<command>", ) -> "Remote": if not name: name = self.repo.config["core"].get("remote") if name: from dvc.fs import get_cloud_fs cls, config, fs_path = get_cloud_fs(self.repo.config, name=name) # ... create and return Remote instance
# Push all tracked datadvc push# Push specific filesdvc push data/train.csv.dvc# Push to specific remotedvc push -r myremote# Push with multiple parallel jobsdvc push -j 8
The push implementation in dvc/data_cloud.py:168-198:
def push( self, objs: Iterable["HashInfo"], jobs: Optional[int] = None, remote: Optional[str] = None, odb: Optional["HashFileDB"] = None,) -> "TransferResult": """Push data items in a cloud-agnostic way. Args: objs: objects to push to the cloud. jobs: number of jobs that can be running simultaneously. remote: optional name of remote to push to. By default remote from core.remote config option is used. odb: optional ODB to push to. Overrides remote. """ if odb is not None: return self._push(objs, jobs=jobs, odb=odb) legacy_objs, default_objs = _split_legacy_hash_infos(objs) result = TransferResult(set(), set()) if legacy_objs: odb = self.get_remote_odb(remote, "push", hash_name="md5-dos2unix") t, f = self._push(legacy_objs, jobs=jobs, odb=odb) result.transferred.update(t) result.failed.update(f) if default_objs: odb = self.get_remote_odb(remote, "push") t, f = self._push(default_objs, jobs=jobs, odb=odb) result.transferred.update(t) result.failed.update(f) return result
Use dvc push -j 16 to speed up uploads with parallel transfers. Adjust based on your network and storage.
# Use AWS credentials filedvc remote modify myremote profile myprofile# Use environment variablesexport AWS_ACCESS_KEY_ID="..."export AWS_SECRET_ACCESS_KEY="..."# Use IAM role (on EC2)# No configuration needed
# Use service accountexport GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"# Or configure explicitlydvc remote modify myremote credentialpath path/to/credentials.json
# Use connection stringdvc remote modify myremote connection_string "..."# Or use account name and keydvc remote modify myremote account_name myaccountdvc remote modify myremote account_key "..."
You can configure multiple remotes for different purposes:
# Default remote for teamdvc remote add -d team s3://team-bucket/dvc-storage# Personal backupdvc remote add backup gs://my-personal-bucket/backup# Local cache for quick accessdvc remote add local /mnt/fast-storage# Push to all remotesdvc push -r teamdvc push -r backup