Skip to main content

Overview

This module covers essential data management practices for machine learning in production. You’ll learn how to deploy storage systems, work with various data formats, handle streaming datasets, implement vector databases for RAG applications, and set up data labeling workflows.

What You’ll Learn

Data Storage

Deploy MinIO locally and on Kubernetes, implement S3-compatible storage, and manage datasets with DVC

Data Formats

Compare storage formats, benchmark pandas performance, and optimize data loading/saving

Streaming Datasets

Create and consume streaming datasets for efficient data loading during training

Vector Databases

Build RAG applications with LanceDB and implement semantic search

Data Labeling

Deploy Argilla for data annotation and create synthetic datasets

Practice Tasks

Complete hands-on exercises to reinforce your learning

Learning Objectives

By the end of this module, you will be able to:
  • Deploy and configure object storage systems (MinIO, S3)
  • Implement Python clients for cloud storage with comprehensive tests
  • Benchmark and select appropriate data formats for your use case
  • Create streaming datasets for efficient training pipelines
  • Build vector databases for semantic search and RAG applications
  • Set up data labeling workflows with annotation tools
  • Version control datasets using DVC

Module Structure

This module is divided into two main sections:

H3: Data Storage & Processing

Focus on storage systems, data formats, and processing performance:
  • MinIO deployment (Docker, Kubernetes)
  • CRUD operations with Python clients
  • Data format benchmarking
  • Parallel inference optimization
  • Streaming datasets
  • Vector databases

H4: Data Labeling & Validation

Focus on data quality and annotation:
  • Labeling guidelines development
  • Argilla deployment and usage
  • Synthetic data generation
  • Data validation techniques
  • Dataset versioning with DVC

Prerequisites

  • Python 3.10+
  • Docker and Kubernetes basics
  • Understanding of pandas and NumPy
  • Familiarity with S3 storage concepts

Key Technologies

  • Storage: MinIO, S3, DVC
  • Formats: Parquet, Feather, HDF5, CSV
  • Streaming: MosaicML Streaming, WebDataset
  • Vector DB: LanceDB, sentence-transformers
  • Labeling: Argilla, Label Studio
  • Processing: Ray, multiprocessing, concurrent.futures
Start with the storage section to set up your infrastructure, then progress through formats and streaming before tackling vector databases and labeling.

Next Steps

Begin with Data Storage to learn how to deploy MinIO and implement S3-compatible storage clients.

Build docs developers (and LLMs) love