Overview
This module includes two practice sections with multiple deliverables. Complete all tasks to demonstrate mastery of data management for ML production systems.H3: Data Storage & Processing
Learning Goals
- Deploy MinIO with multiple configuration options
- Implement and test Python storage clients
- Benchmark data format performance
- Optimize inference with parallel processing
- Create streaming datasets
- Build vector databases for RAG
Reading List
Essential Reading
Essential Reading
Advanced Reading
Advanced Reading
Deep Dives
Deep Dives
Tasks
PR1: MinIO Deployment
Write comprehensive README instructions for deploying MinIO:Requirements:
- Local installation steps
- Docker deployment with docker run
- Kubernetes deployment with manifests
- Port forwarding instructions
- Access credentials and UI setup
minio_storage/README.mdReference: See Storage documentationPR2: MinIO Python Client
Develop CRUD client with comprehensive tests:Requirements:Reference: See Python Client implementation
- Implement both native MinIO and S3FS clients
- Create, read, update, delete operations
- Environment-based configuration
- Pytest fixtures for testing
- Test upload/download functionality
minio_storage/minio_client.pyminio_storage/test_minio_client.py
PR3: Data Format Benchmarks
Benchmark Pandas storage formats:Requirements:
- Test CSV, Parquet, Feather, HDF5
- Measure save time, load time, file size
- Create visualization of results
- Document findings in README
- Recommend format for different use cases
- Write time (seconds)
- Read time (seconds)
- File size (MB)
- Compression ratio
processing/format_benchmark.pyReference: See Format ComparisonPR4: Inference Benchmarks
Benchmark parallel inference performance:Requirements:
Run benchmarks:Reference: See Inference Performance
- Single worker baseline
- ThreadPoolExecutor implementation
- ProcessPoolExecutor implementation
- Ray distributed processing
- Performance comparison table
| Method | Time (s) | Speedup |
|---|---|---|
| Single | X.XX | 1.0x |
| Thread | X.XX | Y.Yx |
| Process | X.XX | Y.Yx |
| Ray | X.XX | Y.Yx |
PR5: Streaming Dataset (Optional)
Convert your dataset to streaming format:Requirements:Reference: See Streaming Datasets
- Choose format (MDS, WebDataset, TFRecord)
- Implement data writer
- Upload to S3/MinIO
- Create DataLoader for reading
- Benchmark loading speed
PR6: Vector Database
Transform dataset to vector format and implement RAG:Requirements:Reference: See Vector Databases
- Convert text data to embeddings
- Create LanceDB/Chroma database
- Implement ingestion pipeline
- Build query interface
- Benchmark query latency
Google Doc: Data Section
Update your design document:Required sections:
- Data Description
- Dataset source and size
- Features and labels
- Data splits (train/val/test)
- Storage Strategy
- Storage backend (S3/MinIO)
- Data format choice (with justification)
- Versioning approach (DVC)
- Processing Pipeline
- Data loading strategy
- Preprocessing steps
- Performance optimizations
- Streaming vs batch
- Infrastructure
- Storage capacity needed
- Compute requirements
- Cost estimates
Success Criteria
H4: Data Labeling & Validation
Learning Goals
- Write effective labeling guidelines
- Deploy annotation tools
- Generate synthetic training data
- Validate data quality
- Version control datasets
Reading List
Labeling Best Practices
Labeling Best Practices
Validation
Validation
Synthetic Data
Synthetic Data
Tasks
Google Doc: Labeling Section
Add comprehensive labeling documentation:1. Labeling GuidelinesReference: See Cost Estimation
- Task definition and objectives
- Label definitions with examples
- Edge case handling
- Quality check procedures
- Decision flowchart
- Label 50 sample manually
- Calculate time per sample
- Estimate total time needed
- Compute labeling budget
- Include calculation methodology
- Data sampling strategy
- Annotation tool setup
- Quality assurance process
- Active learning integration
- Feedback collection
PR1: DVC Dataset Versioning
Commit data with DVC:Requirements:Reference: See Dataset Versioning
- Initialize DVC in repository
- Add dataset files to DVC tracking
- Configure MinIO/S3 as remote
- Push data to remote storage
- Document workflow in README
PR2: Labeling Tool Deployment
Deploy Argilla or Label Studio:Requirements:Files:
- Choose tool (Argilla recommended)
- Create deployment configuration
- Docker/K8s deployment instructions
- Access and authentication setup
- Dataset creation example
labeling/docker-compose.ymlorlabeling/k8s-manifest.yamllabeling/README.mdlabeling/create_dataset.py
PR3: Synthetic Dataset (Optional)
Generate synthetic data with GPT:Requirements:Reference: See Synthetic Data Generation
- Design generation prompt
- Implement retry logic
- Validate generated samples
- Upload to labeling tool
- Compare with real data
PR4: Data Validation (Optional)
Test data quality with Cleanlab or Deepchecks:Requirements:Deepchecks example:Reference: See Data Validation
- Load labeled dataset
- Run integrity checks
- Identify label issues
- Generate validation report
- Document findings
Success Criteria
Submission
Code Requirements
- Formatting: Use
ruff formatfor Python code - Linting: Pass
ruff checkwith no errors - Testing: Run
pytestfrom repository root - Documentation: Include README in each directory
Pull Request Format
Title:[module-2] <concise description>
Example: [module-2] Add MinIO client with S3FS support
Body should include:
- Summary of changes
- How to test
- Performance results (for benchmarks)
- Screenshots (for UI/deployment)
Google Doc Requirements
Your design document should include:-
Data Section (H3 deliverable)
- Dataset description
- Storage architecture
- Processing pipeline
- Performance benchmarks
-
Labeling Section (H4 deliverable)
- Labeling guidelines
- Cost/time estimates
- Production workflow
- Quality assurance plan
Resources
Reference Implementations
All source code available at:Getting Help
- Documentation: Review module pages
- Code Examples: Check source directory
- Community: Ask in course discussion forum
- Office Hours: Attend weekly sessions
Next Steps
After completing Module 2:- Ensure all PRs are merged
- Verify Google Doc is complete
- Proceed to Module 3: Model Training
Module 2 builds the data foundation for your ML system. Take time to understand storage, formats, and labeling deeply - these decisions impact every downstream component.