Skip to main content

Overview

This module includes two practice sections with multiple deliverables. Complete all tasks to demonstrate mastery of data management for ML production systems.

H3: Data Storage & Processing

Learning Goals

  • Deploy MinIO with multiple configuration options
  • Implement and test Python storage clients
  • Benchmark data format performance
  • Optimize inference with parallel processing
  • Create streaming datasets
  • Build vector databases for RAG

Reading List

Tasks

1

PR1: MinIO Deployment

Write comprehensive README instructions for deploying MinIO:Requirements:
  • Local installation steps
  • Docker deployment with docker run
  • Kubernetes deployment with manifests
  • Port forwarding instructions
  • Access credentials and UI setup
Deliverable: minio_storage/README.mdReference: See Storage documentation
2

PR2: MinIO Python Client

Develop CRUD client with comprehensive tests:Requirements:
  • Implement both native MinIO and S3FS clients
  • Create, read, update, delete operations
  • Environment-based configuration
  • Pytest fixtures for testing
  • Test upload/download functionality
Files:
  • minio_storage/minio_client.py
  • minio_storage/test_minio_client.py
Run tests:
pytest -ss ./minio_storage/test_minio_client.py
Reference: See Python Client implementation
3

PR3: Data Format Benchmarks

Benchmark Pandas storage formats:Requirements:
  • Test CSV, Parquet, Feather, HDF5
  • Measure save time, load time, file size
  • Create visualization of results
  • Document findings in README
  • Recommend format for different use cases
Metrics to measure:
  • Write time (seconds)
  • Read time (seconds)
  • File size (MB)
  • Compression ratio
Deliverable: processing/format_benchmark.pyReference: See Format Comparison
4

PR4: Inference Benchmarks

Benchmark parallel inference performance:Requirements:
  • Single worker baseline
  • ThreadPoolExecutor implementation
  • ProcessPoolExecutor implementation
  • Ray distributed processing
  • Performance comparison table
Expected results table:
MethodTime (s)Speedup
SingleX.XX1.0x
ThreadX.XXY.Yx
ProcessX.XXY.Yx
RayX.XXY.Yx
Run benchmarks:
python processing/inference_example.py run-single-worker --inference-size 10000000
python processing/inference_example.py run-pool --inference-size 10000000
python processing/inference_example.py run-ray --inference-size 10000000
Reference: See Inference Performance
5

PR5: Streaming Dataset (Optional)

Convert your dataset to streaming format:Requirements:
  • Choose format (MDS, WebDataset, TFRecord)
  • Implement data writer
  • Upload to S3/MinIO
  • Create DataLoader for reading
  • Benchmark loading speed
Example:
python streaming-dataset/convert_data.py create \
  --input-path ./raw_data \
  --output-path ./streaming_data

aws s3 cp --recursive ./streaming_data s3://datasets/my-data

python streaming-dataset/convert_data.py test \
  --remote s3://datasets/my-data
Reference: See Streaming Datasets
6

PR6: Vector Database

Transform dataset to vector format and implement RAG:Requirements:
  • Convert text data to embeddings
  • Create LanceDB/Chroma database
  • Implement ingestion pipeline
  • Build query interface
  • Benchmark query latency
CLI commands:
# Create database
python vector-db/my_rag.py create \
  --data-path ./data \
  --table-name my_vectors \
  --num-documents 1000

# Query database
python vector-db/my_rag.py query \
  --table-name my_vectors \
  --query "your search query" \
  --top-k 5
Reference: See Vector Databases
7

Google Doc: Data Section

Update your design document:Required sections:
  1. Data Description
    • Dataset source and size
    • Features and labels
    • Data splits (train/val/test)
  2. Storage Strategy
    • Storage backend (S3/MinIO)
    • Data format choice (with justification)
    • Versioning approach (DVC)
  3. Processing Pipeline
    • Data loading strategy
    • Preprocessing steps
    • Performance optimizations
    • Streaming vs batch
  4. Infrastructure
    • Storage capacity needed
    • Compute requirements
    • Cost estimates
Template: Design Doc Template

Success Criteria

H4: Data Labeling & Validation

Learning Goals

  • Write effective labeling guidelines
  • Deploy annotation tools
  • Generate synthetic training data
  • Validate data quality
  • Version control datasets

Reading List

Tasks

1

Google Doc: Labeling Section

Add comprehensive labeling documentation:1. Labeling Guidelines
  • Task definition and objectives
  • Label definitions with examples
  • Edge case handling
  • Quality check procedures
  • Decision flowchart
2. Cost & Time Estimation
  • Label 50 sample manually
  • Calculate time per sample
  • Estimate total time needed
  • Compute labeling budget
  • Include calculation methodology
3. Production Workflow
  • Data sampling strategy
  • Annotation tool setup
  • Quality assurance process
  • Active learning integration
  • Feedback collection
Example calculation:
Pilot: 50 samples in 30 minutes = 36 seconds/sample
Dataset: 10,000 samples
Time: 10,000 × 36s = 100 hours
Cost: 100 hours × $15/hour = $1,500
With QC (20%): $1,800
Reference: See Cost Estimation
2

PR1: DVC Dataset Versioning

Commit data with DVC:Requirements:
  • Initialize DVC in repository
  • Add dataset files to DVC tracking
  • Configure MinIO/S3 as remote
  • Push data to remote storage
  • Document workflow in README
Commands:
# Initialize
dvc init --subdir

# Track data
dvc add ./data/dataset.csv
git add data/.gitignore data/dataset.csv.dvc

# Configure remote
dvc remote add -d storage s3://ml-data
dvc remote modify storage endpointurl $AWS_ENDPOINT_URL

# Push
dvc push
Reference: See Dataset Versioning
3

PR2: Labeling Tool Deployment

Deploy Argilla or Label Studio:Requirements:
  • Choose tool (Argilla recommended)
  • Create deployment configuration
  • Docker/K8s deployment instructions
  • Access and authentication setup
  • Dataset creation example
Argilla Docker:
docker run -it --rm --name argilla -p 6900:6900 \
  argilla/argilla-quickstart:v2.0.0rc1
Files:
  • labeling/docker-compose.yml or labeling/k8s-manifest.yaml
  • labeling/README.md
  • labeling/create_dataset.py
Reference: See Argilla Deployment
4

PR3: Synthetic Dataset (Optional)

Generate synthetic data with GPT:Requirements:
  • Design generation prompt
  • Implement retry logic
  • Validate generated samples
  • Upload to labeling tool
  • Compare with real data
Example:
def generate_sample(prompt_template: str) -> Dict:
    response = openai.ChatCompletion.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

# Generate 1000 samples
samples = [generate_sample(template) for _ in range(1000)]
Reference: See Synthetic Data Generation
5

PR4: Data Validation (Optional)

Test data quality with Cleanlab or Deepchecks:Requirements:
  • Load labeled dataset
  • Run integrity checks
  • Identify label issues
  • Generate validation report
  • Document findings
Cleanlab example:
from cleanlab.classification import CleanLearning

cl = CleanLearning(clf=model)
cl.fit(X_train, labels)

issues = cl.get_label_issues()
print(f"Found {len(issues)} potential errors")

# Save report
issues.to_csv("label_issues.csv")
Deepchecks example:
from deepchecks.tabular.suites import data_integrity

suite = data_integrity()
result = suite.run(dataset)
result.save_as_html("validation_report.html")
Reference: See Data Validation

Success Criteria

Submission

Code Requirements

  • Formatting: Use ruff format for Python code
  • Linting: Pass ruff check with no errors
  • Testing: Run pytest from repository root
  • Documentation: Include README in each directory

Pull Request Format

Title: [module-2] <concise description> Example: [module-2] Add MinIO client with S3FS support Body should include:
  • Summary of changes
  • How to test
  • Performance results (for benchmarks)
  • Screenshots (for UI/deployment)

Google Doc Requirements

Your design document should include:
  1. Data Section (H3 deliverable)
    • Dataset description
    • Storage architecture
    • Processing pipeline
    • Performance benchmarks
  2. Labeling Section (H4 deliverable)
    • Labeling guidelines
    • Cost/time estimates
    • Production workflow
    • Quality assurance plan

Resources

Reference Implementations

All source code available at:
~/workspace/source/module-2/
├── minio_storage/
├── processing/
├── streaming-dataset/
├── vector-db/
└── labeling/

Getting Help

  • Documentation: Review module pages
  • Code Examples: Check source directory
  • Community: Ask in course discussion forum
  • Office Hours: Attend weekly sessions

Next Steps

After completing Module 2:
  1. Ensure all PRs are merged
  2. Verify Google Doc is complete
  3. Proceed to Module 3: Model Training
Module 2 builds the data foundation for your ML system. Take time to understand storage, formats, and labeling deeply - these decisions impact every downstream component.

Build docs developers (and LLMs) love