Practice Tasks

Overview

This module includes two practice sections with multiple deliverables. Complete all tasks to demonstrate mastery of data management for ML production systems.

H3: Data Storage & Processing

Learning Goals

Deploy MinIO with multiple configuration options
Implement and test Python storage clients
Benchmark data format performance
Optimize inference with parallel processing
Create streaming datasets
Build vector databases for RAG

Reading List

Essential Reading

Advanced Reading

Deep Dives

Textbooks & Courses

Tasks

PR1: MinIO Deployment

Write comprehensive README instructions for deploying MinIO:Requirements:

Local installation steps
Docker deployment with docker run
Kubernetes deployment with manifests
Port forwarding instructions
Access credentials and UI setup

Deliverable: minio_storage/README.mdReference: See Storage documentation

PR2: MinIO Python Client

Develop CRUD client with comprehensive tests:Requirements:

Implement both native MinIO and S3FS clients
Create, read, update, delete operations
Environment-based configuration
Pytest fixtures for testing
Test upload/download functionality

Files:

minio_storage/minio_client.py
minio_storage/test_minio_client.py

Run tests:

pytest -ss ./minio_storage/test_minio_client.py

Reference: See Python Client implementation

PR3: Data Format Benchmarks

Benchmark Pandas storage formats:Requirements:

Test CSV, Parquet, Feather, HDF5
Measure save time, load time, file size
Create visualization of results
Document findings in README
Recommend format for different use cases

Metrics to measure:

Write time (seconds)
Read time (seconds)
File size (MB)
Compression ratio

Deliverable: processing/format_benchmark.pyReference: See Format Comparison

PR4: Inference Benchmarks

Benchmark parallel inference performance:Requirements:

Single worker baseline
ThreadPoolExecutor implementation
ProcessPoolExecutor implementation
Ray distributed processing
Performance comparison table

Expected results table:

Method	Time (s)	Speedup
Single	X.XX	1.0x
Thread	X.XX	Y.Yx
Process	X.XX	Y.Yx
Ray	X.XX	Y.Yx

Run benchmarks:

python processing/inference_example.py run-single-worker --inference-size 10000000
python processing/inference_example.py run-pool --inference-size 10000000
python processing/inference_example.py run-ray --inference-size 10000000

Reference: See Inference Performance

PR5: Streaming Dataset (Optional)

Convert your dataset to streaming format:Requirements:

Choose format (MDS, WebDataset, TFRecord)
Implement data writer
Upload to S3/MinIO
Create DataLoader for reading
Benchmark loading speed

Example:

python streaming-dataset/convert_data.py create \
  --input-path ./raw_data \
  --output-path ./streaming_data

aws s3 cp --recursive ./streaming_data s3://datasets/my-data

python streaming-dataset/convert_data.py test \
  --remote s3://datasets/my-data

Reference: See Streaming Datasets

PR6: Vector Database

Transform dataset to vector format and implement RAG:Requirements:

Convert text data to embeddings
Create LanceDB/Chroma database
Implement ingestion pipeline
Build query interface
Benchmark query latency

CLI commands:

# Create database
python vector-db/my_rag.py create \
  --data-path ./data \
  --table-name my_vectors \
  --num-documents 1000

# Query database
python vector-db/my_rag.py query \
  --table-name my_vectors \
  --query "your search query" \
  --top-k 5

Reference: See Vector Databases

Google Doc: Data Section

Update your design document:Required sections:

Data Description
- Dataset source and size
- Features and labels
- Data splits (train/val/test)
Storage Strategy
- Storage backend (S3/MinIO)
- Data format choice (with justification)
- Versioning approach (DVC)
Processing Pipeline
- Data loading strategy
- Preprocessing steps
- Performance optimizations
- Streaming vs batch
Infrastructure
- Storage capacity needed
- Compute requirements
- Cost estimates

Template: Design Doc Template

Success Criteria

H4: Data Labeling & Validation

Learning Goals

Write effective labeling guidelines
Deploy annotation tools
Generate synthetic training data
Validate data quality
Version control datasets

Reading List

Labeling Best Practices

Tools

Validation

Synthetic Data

Stanford Alpaca Data Generation

Tasks

Google Doc: Labeling Section

Add comprehensive labeling documentation:1. Labeling Guidelines

Task definition and objectives
Label definitions with examples
Edge case handling
Quality check procedures
Decision flowchart

2. Cost & Time Estimation

Label 50 sample manually
Calculate time per sample
Estimate total time needed
Compute labeling budget
Include calculation methodology

3. Production Workflow

Data sampling strategy
Annotation tool setup
Quality assurance process
Active learning integration
Feedback collection

Example calculation:

Pilot: 50 samples in 30 minutes = 36 seconds/sample
Dataset: 10,000 samples
Time: 10,000 × 36s = 100 hours
Cost: 100 hours × $15/hour = $1,500
With QC (20%): $1,800

Reference: See Cost Estimation

PR1: DVC Dataset Versioning

Commit data with DVC:Requirements:

Initialize DVC in repository
Add dataset files to DVC tracking
Configure MinIO/S3 as remote
Push data to remote storage
Document workflow in README

Commands:

# Initialize
dvc init --subdir

# Track data
dvc add ./data/dataset.csv
git add data/.gitignore data/dataset.csv.dvc

# Configure remote
dvc remote add -d storage s3://ml-data
dvc remote modify storage endpointurl $AWS_ENDPOINT_URL

# Push
dvc push

Reference: See Dataset Versioning

PR2: Labeling Tool Deployment

Deploy Argilla or Label Studio:Requirements:

Choose tool (Argilla recommended)
Create deployment configuration
Docker/K8s deployment instructions
Access and authentication setup
Dataset creation example

Argilla Docker:

docker run -it --rm --name argilla -p 6900:6900 \
  argilla/argilla-quickstart:v2.0.0rc1

Files:

labeling/docker-compose.yml or labeling/k8s-manifest.yaml
labeling/README.md
labeling/create_dataset.py

Reference: See Argilla Deployment

PR3: Synthetic Dataset (Optional)

Generate synthetic data with GPT:Requirements:

Design generation prompt
Implement retry logic
Validate generated samples
Upload to labeling tool
Compare with real data

Example:

def generate_sample(prompt_template: str) -> Dict:
    response = openai.ChatCompletion.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

# Generate 1000 samples
samples = [generate_sample(template) for _ in range(1000)]

Reference: See Synthetic Data Generation

PR4: Data Validation (Optional)

Test data quality with Cleanlab or Deepchecks:Requirements:

Load labeled dataset
Run integrity checks
Identify label issues
Generate validation report
Document findings

Cleanlab example:

from cleanlab.classification import CleanLearning

cl = CleanLearning(clf=model)
cl.fit(X_train, labels)

issues = cl.get_label_issues()
print(f"Found {len(issues)} potential errors")

# Save report
issues.to_csv("label_issues.csv")

Deepchecks example:

from deepchecks.tabular.suites import data_integrity

suite = data_integrity()
result = suite.run(dataset)
result.save_as_html("validation_report.html")

Reference: See Data Validation

Success Criteria

Submission

Code Requirements

Formatting: Use ruff format for Python code
Linting: Pass ruff check with no errors
Testing: Run pytest from repository root
Documentation: Include README in each directory

Pull Request Format

Title: [module-2] <concise description> Example: [module-2] Add MinIO client with S3FS support Body should include:

Summary of changes
How to test
Performance results (for benchmarks)
Screenshots (for UI/deployment)

Google Doc Requirements

Your design document should include:

Data Section (H3 deliverable)
- Dataset description
- Storage architecture
- Processing pipeline
- Performance benchmarks
Labeling Section (H4 deliverable)
- Labeling guidelines
- Cost/time estimates
- Production workflow
- Quality assurance plan

Resources

Reference Implementations

All source code available at:

~/workspace/source/module-2/
├── minio_storage/
├── processing/
├── streaming-dataset/
├── vector-db/
└── labeling/

Getting Help

Documentation: Review module pages
Code Examples: Check source directory
Community: Ask in course discussion forum
Office Hours: Attend weekly sessions

Next Steps

After completing Module 2:

Ensure all PRs are merged
Verify Google Doc is complete
Proceed to Module 3: Model Training

Module 2 builds the data foundation for your ML system. Take time to understand storage, formats, and labeling deeply - these decisions impact every downstream component.

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

Overview

H3: Data Storage & Processing

Learning Goals

Reading List

Tasks

Success Criteria

H4: Data Labeling & Validation

Learning Goals

Reading List

Tasks

Success Criteria

Submission

Code Requirements

Pull Request Format

Google Doc Requirements

Resources

Reference Implementations

Getting Help

Next Steps

Build docs developers (and LLMs) love

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

​Overview

​H3: Data Storage & Processing

​Learning Goals

​Reading List

​Tasks

​Success Criteria

​H4: Data Labeling & Validation

​Learning Goals

​Reading List

​Tasks

​Success Criteria

​Submission

​Code Requirements

​Pull Request Format

​Google Doc Requirements

​Resources

​Reference Implementations

​Getting Help

​Next Steps

Build docs developers (and LLMs) love

Overview

H3: Data Storage & Processing

Learning Goals

Reading List

Tasks

Success Criteria

H4: Data Labeling & Validation

Learning Goals

Reading List

Tasks

Success Criteria

Submission

Code Requirements

Pull Request Format

Google Doc Requirements

Resources

Reference Implementations

Getting Help

Next Steps