Why Data Management is Critical
Data is the foundation of every ML system. Poor data management leads to:- Unreproducible experiments: “Which dataset version produced this model?”
- Slow training: Loading data becomes the bottleneck
- Data drift: Production data diverges from training data
- Collaboration issues: Team members can’t share datasets easily
Storage Solutions
Object Storage (S3-Compatible)
Object storage is the backbone of ML data pipelines:- S3 (AWS), GCS (Google), Azure Blob: Cloud-native options
- MinIO: Self-hosted S3-compatible storage (great for on-prem or local development)
Object storage is cheaper than block storage and scales infinitely, making it ideal for large datasets, model checkpoints, and artifacts.
Why MinIO for Development?
MinIO lets you run an S3-compatible server locally:Data Formats
Choosing the right format dramatically affects performance:CSV
Human-readable, but slow and large
Parquet
Columnar, compressed, 10-100x faster than CSV
Arrow/Feather
Zero-copy reads, excellent for in-memory operations
| Format | Read Time | File Size |
|---|---|---|
| CSV | 12.64s | 100 MB |
| Parquet | 0.85s | 25 MB |
| Feather | 0.42s | 30 MB |
For ML pipelines, use Parquet for storage and Feather for intermediate processing. Avoid CSV except for human inspection.
Data Versioning
DVC (Data Version Control)
DVC tracks large files in Git without storing them:.dvc files with hashes that Git tracks, while the actual data lives in S3/MinIO.
DVC also supports pipeline tracking, letting you version transformations and derived datasets alongside the code that created them.
Streaming Datasets
For datasets too large to fit in memory, streaming is essential:- Hugging Face Datasets: Streaming + caching + memory mapping
- MosaicML Streaming: Optimized for multi-GPU training
- WebDataset: Tar-based streaming format
- TensorFlow tf.data / PyTorch TorchData: Native streaming APIs
Streaming prevents OOM errors and reduces startup time. Use it for datasets > 50GB or when training on multiple nodes.
Efficient Batch Processing
When running inference on large datasets, parallelization is key: Module 2 Benchmark (10M rows):| Method | Time |
|---|---|
| Single worker | 12.64s |
| ThreadPoolExecutor (16 workers) | 0.85s |
| ProcessPoolExecutor (16 workers) | 4.03s |
| Ray | 2.19s |
Threads work best for I/O-bound tasks (API calls, disk reads). Processes are better for CPU-bound tasks (heavy computation). Ray adds distributed scheduling overhead but scales across machines.
Vector Databases
For embedding-based retrieval (RAG, semantic search):LanceDB
Disk-based, handles billions of vectors, open-source
Chroma
In-memory, easy setup, good for prototypes
Qdrant
Production-ready, supports filtering and hybrid search
Pinecone / Weaviate
Managed services with SLA and support
Data Labeling
Argilla
Argilla is an open-source platform for labeling and curating datasets:- Human-in-the-loop labeling workflows
- Active learning integration
- Feedback collection for LLMs
- Export to Hugging Face Datasets
For production labeling, consider managed services like Scale AI, Labelbox, or Snorkel Flow. Argilla is excellent for team collaboration and rapid iteration.
Hands-On Examples
Explore data management in Module 2:- Deploy MinIO on Kubernetes
- Benchmark CSV vs Parquet vs Feather
- Set up DVC with S3 remote
- Build streaming dataloaders
- Create vector databases with LanceDB
- Set up Argilla for labeling
Next Steps
Training Workflows
Use versioned data in training pipelines
Monitoring
Detect data drift in production