Skip to main content

Description

Returns the complete contents of a file tracked by DVC or Git. This is a convenience function that reads the entire file at once without requiring a context manager. For Git repositories, HEAD is used unless a rev argument is supplied. The default remote is tried unless a remote argument is supplied.
For large files, consider using dvc.api.open() instead to stream data and avoid loading the entire file into memory.

Signature

dvc.api.read(
    path: str,
    repo: Optional[str] = None,
    rev: Optional[str] = None,
    remote: Optional[str] = None,
    mode: str = "r",
    encoding: Optional[str] = None,
    config: Optional[dict[str, Any]] = None,
    remote_config: Optional[dict[str, Any]] = None,
) -> Union[str, bytes]

Parameters

path
str
required
Location and filename of the target file, relative to the root of the repository.
path="data/train.csv"
path="configs/params.yaml"
path="models/weights.pkl"
repo
str
default:"None"
Location of the DVC or Git repository. Defaults to the current project (found by walking up from the current working directory).Can be:
  • A URL to a Git repository (HTTP or SSH)
  • A local file system path
  • None to use the current repository
# Remote repository
repo="https://github.com/iterative/example-get-started"

# SSH URL
repo="[email protected]:user/repo.git"

# Local path
repo="/home/user/my-dvc-project"
rev
str
default:"None"
Any Git revision such as a branch name, tag name, commit hash, or DVC experiment name.
  • Defaults to HEAD for Git repositories
  • For local repositories, uses the working directory if not specified
  • Ignored if repo is not a Git repository
rev="main"               # Branch
rev="v2.0.0"             # Tag
rev="a3f5c2d"            # Commit hash  
rev="exp-best-model"     # Experiment
remote
str
default:"None"
Name of the DVC remote to use for fetching data. Defaults to the repository’s default remote.For local projects, the cache is checked before the default remote.
remote="myremote"
remote="aws-s3-storage"
mode
str
default:"r"
Mode in which to open the file. Defaults to "r" (read text mode).Only reading modes are supported:
  • "r" - Read text mode (returns str)
  • "rb" - Read binary mode (returns bytes)
mode="r"   # For text files
mode="rb"  # For binary files
encoding
str
default:"None"
Text encoding to use (e.g., "utf-8", "latin-1"). Only applicable in text mode (mode="r").Mirrors the encoding parameter in Python’s built-in open().
encoding="utf-8"
encoding="iso-8859-1"
config
dict
default:"None"
DVC config dictionary to pass to the repository.
config={"cache": {"dir": "/tmp/dvc-cache"}}
remote_config
dict
default:"None"
Remote configuration dictionary to pass to the repository.
remote_config={"url": "s3://my-bucket/dvc-storage"}

Returns

contents
Union[str, bytes]
The complete contents of the file:
  • Returns str when mode="r" (text mode)
  • Returns bytes when mode="rb" (binary mode)

Raises

FileMissingError
exception
Raised when the specified file does not exist in the repository.
OutputNotFoundError
exception
Raised when the file is not tracked by DVC.
ValueError
exception
Raised when a non-read mode is specified.

Examples

Basic Text File Reading

import dvc.api

# Read a CSV file
data = dvc.api.read(
    'data/train.csv',
    repo='https://github.com/iterative/example-get-started'
)
print(data)

Read Configuration File

import dvc.api
import yaml

# Read YAML parameters
params_yaml = dvc.api.read('params.yaml')
params = yaml.safe_load(params_yaml)

print(f"Learning rate: {params['train']['lr']}")
print(f"Epochs: {params['train']['epochs']}")

Read JSON Metrics

import dvc.api
import json

# Read metrics from a specific branch
metrics_json = dvc.api.read(
    'metrics/accuracy.json',
    rev='experiment-branch'
)
metrics = json.loads(metrics_json)

print(f"Accuracy: {metrics['accuracy']}")
print(f"F1 Score: {metrics['f1_score']}")

Binary File Reading

import dvc.api
import pickle

# Read a pickled model (binary mode)
model_bytes = dvc.api.read(
    'models/classifier.pkl',
    mode='rb',
    rev='production'
)
model = pickle.loads(model_bytes)

predictions = model.predict(X_test)

Read from Specific Tag

import dvc.api

# Get data from a released version
data_v1 = dvc.api.read(
    'data/dataset.csv',
    repo='https://github.com/user/ml-project',
    rev='v1.0.0'
)

data_v2 = dvc.api.read(
    'data/dataset.csv',
    repo='https://github.com/user/ml-project',
    rev='v2.0.0'
)

print(f"V1 size: {len(data_v1)} bytes")
print(f"V2 size: {len(data_v2)} bytes")

Private Repository with SSH

import dvc.api

# Access private repository (requires SSH keys configured)
data = dvc.api.read(
    'sensitive/data.txt',
    repo='[email protected]:company/private-repo.git',
    rev='main'
)

Read with Custom Encoding

import dvc.api

# Read file with specific encoding
data = dvc.api.read(
    'data/international.txt',
    encoding='utf-16'
)
print(data)

Read NumPy Array

import dvc.api
import numpy as np
from io import BytesIO

# Read binary NumPy file
array_bytes = dvc.api.read(
    'data/features.npy',
    mode='rb'
)
array = np.load(BytesIO(array_bytes))

print(f"Shape: {array.shape}")
print(f"Dtype: {array.dtype}")

Read from Local Repository

import dvc.api

# Read from local project by path
data = dvc.api.read(
    'data/processed.csv',
    repo='/path/to/my/project'
)

Error Handling

import dvc.api
from dvc.exceptions import FileMissingError, OutputNotFoundError

try:
    data = dvc.api.read(
        'data/missing.csv',
        repo='https://github.com/user/repo'
    )
except OutputNotFoundError:
    print("File is not tracked by DVC")
except FileMissingError:
    print("File does not exist in the repository")
except Exception as e:
    print(f"Unexpected error: {e}")

Use Cases

Configuration Loading

Load parameters, configs, or metadata files for experiments.

Small Data Files

Read datasets that fit comfortably in memory.

Model Loading

Load serialized models for inference or evaluation.

Metrics Retrieval

Fetch experiment metrics for analysis and comparison.

Comparison with dvc.api.open()

read() is a convenience wrapper around open() that reads the entire file and returns its contents.
Featuredvc.api.read()dvc.api.open()
UsageSimple function callContext manager (with statement)
ReturnsComplete file contentsFile object for streaming
MemoryLoads entire fileStreams incrementally
Best forSmall filesLarge files
Codedata = dvc.api.read('file.csv')with dvc.api.open('file.csv') as f: ...
# Using read() - Simpler for small files
data = dvc.api.read('small_config.json')
config = json.loads(data)

# Using open() - Better for large files
with dvc.api.open('large_dataset.csv') as f:
    for line in f:
        process(line)

Performance Considerations

read() loads the entire file into memory. For large files (>100MB), use dvc.api.open() to stream data instead.
# Efficient for small files
data = dvc.api.read('config.json')

Best Practices

read() is ideal for configuration files, parameters, and small datasets:
# ✅ Good - Small config file
params = yaml.safe_load(dvc.api.read('params.yaml'))

# ❌ Bad - Large dataset (use open() instead)
data = dvc.api.read('huge_dataset.csv')  # May cause memory issues
Use text mode for text files and binary mode for binary data:
# Text files
text = dvc.api.read('data.txt', mode='r')

# Binary files
data = dvc.api.read('model.pkl', mode='rb')
Remember to parse the returned string/bytes:
# JSON
json_str = dvc.api.read('data.json')
data = json.loads(json_str)

# YAML
yaml_str = dvc.api.read('config.yaml')
config = yaml.safe_load(yaml_str)

# CSV (use open() for large CSVs)
csv_str = dvc.api.read('small.csv')
lines = csv_str.split('\n')
Always catch potential exceptions:
from dvc.exceptions import FileMissingError, OutputNotFoundError

try:
    data = dvc.api.read('data.csv')
except OutputNotFoundError:
    print("Not tracked by DVC")
except FileMissingError:
    print("File not found")

open()

Stream files with context manager

get_url()

Get remote storage URL

DVCFileSystem

Low-level file system access

Build docs developers (and LLMs) love