Source: dvc/api/data.py:102-255
Description
Opens a DVC-tracked file and returns a file object for streaming. This function must be used as a context manager with thewith keyword.
Unlike dvc.api.read(), this function streams file contents directly from remote storage, allowing you to process data incrementally without loading the entire file into memory.
Signature
Parameters
Location and filename of the target file, relative to the root of the repository.
Location of the DVC or Git repository. Defaults to the current project (found by walking up from the current working directory).Can be:
- A URL to a Git repository
- A local file system path
- HTTP and SSH protocols are supported
Git revision such as a branch name, tag name, commit hash, or DVC experiment name.
- Defaults to
HEADfor Git repositories - For local repositories without
rev, reads from the working directory - Ignored if
repois not a Git repository
Name of the DVC remote to use. Defaults to the repository’s default remote.For local projects, the cache is tried before the default remote.
Mode in which to open the file. Defaults to
"r" (read mode).Only reading modes are supported.Text encoding to use (e.g.,
"utf-8", "latin-1"). Only applicable in text mode.Mirrors the encoding parameter in Python’s built-in open().DVC config dictionary to pass to the repository.
Remote configuration dictionary to pass to the repository.
Returns
A context manager that yields a file object. The exact type depends on the mode:
- Text mode (
mode="r"): Returns a text file object - Binary mode (
mode="rb"): Returns a binary file object
read(), readline(), and iteration.Raises
Raised when the function is used without a context manager (without
with statement).Raised when a non-read mode is specified (e.g.,
mode="w").Raised when the specified file does not exist in the repository.
Raised when the file is not tracked by DVC.
Examples
Basic File Reading
Streaming Large Files
Using with Pandas
Binary File (Model Weights)
XML Parsing with SAX
Private Repository Access
Specific Git Revision
Custom Encoding
Using Specific Remote
Use Cases
Streaming Large Files
Process files larger than available RAM by reading incrementally.
Data Pipeline Integration
Load DVC-tracked datasets directly into training or processing pipelines.
Version-Specific Data
Access different versions of data from various branches or experiments.
Remote Data Access
Stream data directly from cloud storage without local downloads.
Comparison with dvc.api.read()
| Feature | dvc.api.open() | dvc.api.read() |
|---|---|---|
| Usage | Context manager (with statement) | Simple function call |
| Memory | Streams data incrementally | Loads entire file |
| Best for | Large files, streaming | Small files, complete reads |
| Returns | File object | File contents (str/bytes) |
Best Practices
Always use with statement
Always use with statement
The function must be used as a context manager. This ensures proper cleanup:
Choose appropriate mode
Choose appropriate mode
Use text mode for text files and binary mode for binary data:
Stream large files
Stream large files
For large files, process data incrementally instead of reading everything:
Handle exceptions
Handle exceptions
Wrap API calls in try-except blocks for robust error handling:
Related Functions
read()
Read complete file contents in one call
get_url()
Get the remote storage URL
DVCFileSystem
Low-level file system interface