Module 2: Data Management

Learn more about Mintlify

Enter your email to receive updates about new features and product releases.

Overview
What You’ll Learn
Learning Objectives
Module Structure
H3: Data Storage & Processing
H4: Data Labeling & Validation
Prerequisites
Key Technologies
Next Steps

Overview

This module covers essential data management practices for machine learning in production. You’ll learn how to deploy storage systems, work with various data formats, handle streaming datasets, implement vector databases for RAG applications, and set up data labeling workflows.

What You’ll Learn

Data Storage

Deploy MinIO locally and on Kubernetes, implement S3-compatible storage, and manage datasets with DVC

Data Formats

Compare storage formats, benchmark pandas performance, and optimize data loading/saving

Streaming Datasets

Create and consume streaming datasets for efficient data loading during training

Vector Databases

Build RAG applications with LanceDB and implement semantic search

Data Labeling

Deploy Argilla for data annotation and create synthetic datasets

Practice Tasks

Complete hands-on exercises to reinforce your learning

Learning Objectives

By the end of this module, you will be able to:

Deploy and configure object storage systems (MinIO, S3)
Implement Python clients for cloud storage with comprehensive tests
Benchmark and select appropriate data formats for your use case
Create streaming datasets for efficient training pipelines
Build vector databases for semantic search and RAG applications
Set up data labeling workflows with annotation tools
Version control datasets using DVC

Module Structure

This module is divided into two main sections:

H3: Data Storage & Processing

Focus on storage systems, data formats, and processing performance:

MinIO deployment (Docker, Kubernetes)
CRUD operations with Python clients
Data format benchmarking
Parallel inference optimization
Streaming datasets
Vector databases

H4: Data Labeling & Validation

Focus on data quality and annotation:

Labeling guidelines development
Argilla deployment and usage
Synthetic data generation
Data validation techniques
Dataset versioning with DVC

Prerequisites

Python 3.10+
Docker and Kubernetes basics
Understanding of pandas and NumPy
Familiarity with S3 storage concepts

Key Technologies

Storage: MinIO, S3, DVC
Formats: Parquet, Feather, HDF5, CSV
Streaming: MosaicML Streaming, WebDataset
Vector DB: LanceDB, sentence-transformers
Labeling: Argilla, Label Studio
Processing: Ray, multiprocessing, concurrent.futures

Start with the storage section to set up your infrastructure, then progress through formats and streaming before tackling vector databases and labeling.

Next Steps

Begin with Data Storage to learn how to deploy MinIO and implement S3-compatible storage clients.

Practice Exercise

Data Storage

⌘I

Build docs developers (and LLMs) love

Get started for free Talk to us

Module 1: Infrastructure

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

Module 2: Data Management

Overview

What You’ll Learn

Data Storage

Data Formats

Streaming Datasets

Vector Databases

Data Labeling

Practice Tasks

Learning Objectives

Module Structure

H3: Data Storage & Processing

H4: Data Labeling & Validation

Prerequisites

Key Technologies

Next Steps

Build docs developers (and LLMs) love

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

​Overview

​What You’ll Learn

Data Storage

Data Formats

Streaming Datasets

Vector Databases

Data Labeling

Practice Tasks

​Learning Objectives

​Module Structure

​H3: Data Storage & Processing

​H4: Data Labeling & Validation

​Prerequisites

​Key Technologies

​Next Steps

Build docs developers (and LLMs) love

Overview

What You’ll Learn

Learning Objectives

Module Structure

H3: Data Storage & Processing

H4: Data Labeling & Validation

Prerequisites

Key Technologies

Next Steps