Skip to main content

Description

Opens a DVC-tracked file and returns a file object for streaming. This function must be used as a context manager with the with keyword. Unlike dvc.api.read(), this function streams file contents directly from remote storage, allowing you to process data incrementally without loading the entire file into memory.

Signature

dvc.api.open(
    path: str,
    repo: Optional[str] = None,
    rev: Optional[str] = None,
    remote: Optional[str] = None,
    mode: str = "r",
    encoding: Optional[str] = None,
    config: Optional[dict[str, Any]] = None,
    remote_config: Optional[dict[str, Any]] = None,
)

Parameters

path
str
required
Location and filename of the target file, relative to the root of the repository.
# Examples
path="data/train.csv"
path="models/model.pkl"
path="features/embeddings.npy"
repo
str
default:"None"
Location of the DVC or Git repository. Defaults to the current project (found by walking up from the current working directory).Can be:
  • A URL to a Git repository
  • A local file system path
  • HTTP and SSH protocols are supported
# Remote repository
repo="https://github.com/iterative/example-get-started"

# Private repository via SSH
repo="[email protected]:user/private-repo.git"

# Local repository
repo="/path/to/local/repo"
rev
str
default:"None"
Git revision such as a branch name, tag name, commit hash, or DVC experiment name.
  • Defaults to HEAD for Git repositories
  • For local repositories without rev, reads from the working directory
  • Ignored if repo is not a Git repository
rev="main"              # Branch
rev="v1.0.0"            # Tag
rev="abc123"            # Commit hash
rev="exp-random-forest" # Experiment name
remote
str
default:"None"
Name of the DVC remote to use. Defaults to the repository’s default remote.For local projects, the cache is tried before the default remote.
remote="myremote"
remote="s3-storage"
mode
str
default:"r"
Mode in which to open the file. Defaults to "r" (read mode).Only reading modes are supported.
mode="r"   # Read text mode
mode="rb"  # Read binary mode
encoding
str
default:"None"
Text encoding to use (e.g., "utf-8", "latin-1"). Only applicable in text mode.Mirrors the encoding parameter in Python’s built-in open().
encoding="utf-8"
encoding="latin-1"
config
dict
default:"None"
DVC config dictionary to pass to the repository.
config={"cache": {"type": "symlink"}}
remote_config
dict
default:"None"
Remote configuration dictionary to pass to the repository.
remote_config={"url": "s3://mybucket/path"}

Returns

file_object
_OpenContextManager
A context manager that yields a file object. The exact type depends on the mode:
  • Text mode (mode="r"): Returns a text file object
  • Binary mode (mode="rb"): Returns a binary file object
The file object supports standard file operations like read(), readline(), and iteration.

Raises

AttributeError
exception
Raised when the function is used without a context manager (without with statement).
ValueError
exception
Raised when a non-read mode is specified (e.g., mode="w").
FileMissingError
exception
Raised when the specified file does not exist in the repository.
OutputNotFoundError
exception
Raised when the file is not tracked by DVC.

Examples

Basic File Reading

import dvc.api

with dvc.api.open(
    'data/train.csv',
    repo='https://github.com/iterative/example-get-started'
) as f:
    data = f.read()
    print(data)

Streaming Large Files

import dvc.api

# Process file line by line without loading entire file
with dvc.api.open('data/large_dataset.txt') as f:
    for line in f:
        process_line(line)

Using with Pandas

import dvc.api
import pandas as pd

with dvc.api.open(
    'data/features.csv',
    repo='https://github.com/user/ml-project'
) as f:
    df = pd.read_csv(f)
    print(df.head())

Binary File (Model Weights)

import dvc.api
import pickle

with dvc.api.open(
    'models/classifier.pkl',
    mode='rb',
    rev='v1.0.0'
) as f:
    model = pickle.load(f)
    predictions = model.predict(X_test)

XML Parsing with SAX

import dvc.api
from xml.sax import parse

# Memory-efficient streaming XML parsing
with dvc.api.open(
    'data/large_dataset.xml',
    repo='https://github.com/iterative/dataset-registry'
) as fd:
    parse(fd, MySAXHandler())

Private Repository Access

import dvc.api

# Access private repo via SSH (requires configured SSH keys)
with dvc.api.open(
    'features.dat',
    repo='[email protected]:company/private-ml-repo.git',
    rev='production'
) as f:
    features = f.read()

Specific Git Revision

import dvc.api
import json

# Read from a specific experiment
with dvc.api.open(
    'metrics/results.json',
    rev='exp-tuned-hyperparams'
) as f:
    metrics = json.load(f)
    print(f"Accuracy: {metrics['accuracy']}")

Custom Encoding

import dvc.api

with dvc.api.open(
    'data/european_text.txt',
    encoding='latin-1'
) as f:
    content = f.read()

Using Specific Remote

import dvc.api

# Explicitly specify which remote to use
with dvc.api.open(
    'data/dataset.csv',
    remote='s3-backup',
    repo='/path/to/local/repo'
) as f:
    data = f.read()

Use Cases

Streaming Large Files

Process files larger than available RAM by reading incrementally.

Data Pipeline Integration

Load DVC-tracked datasets directly into training or processing pipelines.

Version-Specific Data

Access different versions of data from various branches or experiments.

Remote Data Access

Stream data directly from cloud storage without local downloads.

Comparison with dvc.api.read()

Choose open() for large files or when you need streaming. Use read() for small files when you need the complete content immediately.
Featuredvc.api.open()dvc.api.read()
UsageContext manager (with statement)Simple function call
MemoryStreams data incrementallyLoads entire file
Best forLarge files, streamingSmall files, complete reads
ReturnsFile objectFile contents (str/bytes)
# open() - Memory efficient
with dvc.api.open('large_file.csv') as f:
    for line in f:  # Processes line by line
        process(line)

# read() - Simpler but loads everything
data = dvc.api.read('large_file.csv')  # Entire file in memory
for line in data.split('\n'):
    process(line)

Best Practices

The function must be used as a context manager. This ensures proper cleanup:
# ✅ Correct
with dvc.api.open('data.csv') as f:
    data = f.read()

# ❌ Wrong - Will raise AttributeError
f = dvc.api.open('data.csv')
data = f.read()
Use text mode for text files and binary mode for binary data:
# Text files
with dvc.api.open('data.txt', mode='r') as f:
    text = f.read()

# Binary files (models, images, etc.)
with dvc.api.open('model.pkl', mode='rb') as f:
    model = pickle.load(f)
For large files, process data incrementally instead of reading everything:
# Memory efficient
with dvc.api.open('huge_file.txt') as f:
    for line in f:
        process(line)  # Only one line in memory at a time
Wrap API calls in try-except blocks for robust error handling:
from dvc.exceptions import FileMissingError, OutputNotFoundError

try:
    with dvc.api.open('data.csv') as f:
        data = f.read()
except OutputNotFoundError:
    print("File not tracked by DVC")
except FileMissingError:
    print("File not found")

read()

Read complete file contents in one call

get_url()

Get the remote storage URL

DVCFileSystem

Low-level file system interface

Build docs developers (and LLMs) love