Introduction
Streaming datasets enable efficient data loading for large-scale training by:- Downloading data on-demand during training
- Reducing local storage requirements
- Supporting deterministic shuffling
- Enabling infinite data iteration
Why Streaming Datasets?
Scalability
Train on datasets larger than available disk space by streaming from cloud storage
Efficiency
Start training immediately without waiting for full dataset downloads
Flexibility
Deterministic shuffling across epochs with configurable seed values
Cost Savings
Reduce storage costs by caching only actively used data
MosaicML Streaming
MosaicML Streaming provides an efficient format for training data.Installation
Creating a Streaming Dataset
Convert your data to MDS (Mosaic Data Shard) format:streaming-dataset/mock_data.py
columns: Schema definition with field typescompression: Algorithm (zstd, gzip, snappy, none)out: Local directory path
Supported Data Types
| Type | Description | Example |
|---|---|---|
int | Integer values | Labels, IDs |
str | Text strings | Prompts, captions |
bytes | Raw bytes | Custom encodings |
jpeg | JPEG images | Photos |
png | PNG images | Graphics |
pkl | Pickle objects | Complex types |
json | JSON objects | Metadata |
Upload to Cloud Storage
Consuming Streaming Data
Load and train from remote storage:streaming-dataset/mock_data.py
CLI Usage
Create and consume data:Training Integration
Integrate with PyTorch training loops:Advanced Features
Deterministic Shuffling
Deterministic Shuffling
Control shuffle behavior across epochs:Benefits:
- Reproducible training runs
- Different shuffle per epoch
- Efficient block-level shuffling
Multi-Node Training
Multi-Node Training
Shard data across workers:Automatically handles:
- Data sharding per worker
- Epoch boundaries
- Sample uniqueness
Caching Strategy
Caching Strategy
Control local cache behavior:Cache is managed via LRU eviction.
Profiling
Profiling
Monitor download and processing:
Alternative Solutions
- TFRecord
- WebDataset
- PyTorch S3 Plugin
- DataTrove
Best Practices
Shard Size
- Target 50-500MB per shard
- Balance parallelism vs overhead
- Consider download time
Compression
- Use zstd for best ratio/speed
- Disable for pre-compressed data (JPEG)
- Test impact on throughput
Caching
- Size cache > single epoch
- Use fast local storage (SSD)
- Monitor cache hit rate
Workers
- Match to CPU cores
- Increase for I/O-bound tasks
- Profile to find optimal count
Resources
- MosaicML Streaming Docs
- High Performance I/O For Large Scale Deep Learning (PDF)
- Efficient PyTorch I/O Library
- Announcing CPP-based S3 IO DataPipes
Next Steps
- Explore Vector Databases for embeddings storage
- Learn about Data Labeling workflows