Overview
The datastore consists of several layers:- FlowDataStore: Top-level store for a flow, manages TaskDataStores and content-addressed storage
- TaskDataStore: Stores artifacts, metadata, and logs for a specific task
- ContentAddressedStore: Deduplicates and stores artifacts based on content hash
- DataStoreStorage: Backend implementation (S3, Azure Blob, Local, etc.)
self.artifact = value) rather than directly. This page documents the underlying concepts and direct APIs.
Storage Architecture
Content-Addressed Storage
Artifacts in Metaflow are stored using content addressing, which means:- Each artifact is identified by a hash of its contents
- Identical artifacts are stored only once, regardless of how many tasks produce them
- Storage is efficient and deduplication is automatic
Hierarchical Metadata Storage
Metadata is stored hierarchically by pathspec:FlowDataStore
TheFlowDataStore is the top-level datastore for a flow.
Initialization
Properties
datastore_root (str): Root path for all storage
TYPE (str): Storage backend type (e.g., “s3”, “azure”, “local”)
ca_store (ContentAddressedStore): Content-addressed storage instance
Methods
get_task_datastores
Retrieve TaskDataStore instances for specific tasks:run_id: Run ID to filter taskssteps: List of step names to includepathspecs: Alternative to run_id/steps - full task pathspecsallow_not_done: Include tasks without DONE markerattempt: Specific attempt number to retrievemode: “r” for read, “w” for write
TaskDataStore instances
TaskDataStore
TheTaskDataStore handles storage for a single task’s artifacts, metadata, and logs.
Initialization
Usually obtained throughFlowDataStore.get_task_datastores() or FlowDataStore.get_datastore_for_task().
Modes
- Read mode (
'r'): Load existing artifacts and metadata - Write mode (
'w'): Save new artifacts and metadata
Artifact Operations
save_artifacts
Save task artifacts (called automatically by Metaflow runtime):load_artifacts
Load task artifacts:Metadata Operations
save_metadata
Save task metadata:load_metadata
Load task metadata:Log Operations
save_log
Save task logs:load_log
Load task logs:Lifecycle
done
Mark the task as complete:is_done
Check if task is marked as done:Storage Backends
Metaflow supports multiple storage backends:S3 Storage
Default for AWS deployments:Azure Blob Storage
For Azure deployments:Local Storage
For development and testing:Advanced Usage
Content-Addressed Store
Direct access to content-addressed storage:Custom Metadata
Store custom metadata for tasks:Configuration
Datastore behavior is controlled by environment variables and configuration:METAFLOW_DATASTORE_SYSROOT_S3: S3 root for artifact storageMETAFLOW_DATASTORE_SYSROOT_AZURE: Azure root for artifact storageMETAFLOW_DATASTORE_SYSROOT_LOCAL: Local root for artifact storageMETAFLOW_DEFAULT_DATASTORE: Default storage backend type
Artifact Serialization
Artifacts are serialized using pickle (protocol 2 by default, protocol 4 for Python 3.6+):UnpicklableArtifactException.
Best Practices
- Let Metaflow manage the datastore: Use
self.artifact = valueinstead of direct datastore calls - Keep artifacts reasonably sized: Very large artifacts (>100GB) can be slow to serialize
- Use external storage for huge datasets: For multi-terabyte datasets, use S3 client directly
- Leverage deduplication: Identical artifacts across tasks are stored once
- Use metadata for lightweight data: Store task metrics and status in metadata, not artifacts
Error Handling
DataException
Raised for general datastore errors:UnpicklableArtifactException
Raised when an artifact can’t be serialized:Performance Considerations
Parallel Operations
The datastore supports parallel operations for efficiency:- Multiple tasks can read/write simultaneously
- Content-addressed storage enables efficient deduplication
- S3 backend uses multipart uploads for large artifacts
Caching
Metadata can be cached for performance:Compression
Artifacts can be compressed (gzip) for storage:- Reduces storage costs
- May increase CPU usage during serialization
- Configured per storage backend
Related
- S3 - High-level S3 client for direct cloud storage access
- IncludeFile - Include local files as flow parameters
- Artifacts - Working with artifacts in flows
