Skip to main content

Introduction

Proper data storage is critical for ML systems. This guide covers deploying MinIO (an S3-compatible object storage), implementing Python clients, and versioning datasets with DVC.

MinIO Setup

MinIO provides S3-compatible object storage that can run locally, in Docker, or on Kubernetes.

Docker Deployment

The simplest way to get started:
docker run -it -p 9000:9000 -p 9001:9001 \
  quay.io/minio/minio server /data --console-address ":9001"
  • Port 9000: API endpoint
  • Port 9001: Web console UI
  • Default credentials: minioadmin / minioadmin

Kubernetes Deployment

1

Create Kind Cluster

kind create cluster --name ml-in-production
2

Deploy MinIO

kubectl create -f minio_storage/minio-standalone-dev.yaml
3

Access Services

Port-forward both API and console:
kubectl port-forward --address=0.0.0.0 pod/minio 9000:9000
kubectl port-forward --address=0.0.0.0 pod/minio 9001:9001
4

Monitor with k9s

k9s -A
If you encounter UI access issues, see this MinIO console issue.

S3 Access via AWS CLI

MinIO is fully S3-compatible, so you can use the AWS CLI:

Configuration

export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
export AWS_ENDPOINT_URL=http://127.0.0.1:9000

Common Operations

# List buckets
aws s3 ls

# Create bucket
aws s3api create-bucket --bucket test

# Upload files
aws s3 cp --recursive . s3://test/

Python Client Implementation

Two approaches for implementing MinIO clients in Python.

Native MinIO Client

Using the official MinIO SDK:
minio_storage/minio_client.py
import os
from pathlib import Path
from minio import Minio

ACCESS_KEY = os.getenv("AWS_ACCESS_KEY_ID")
SECRET_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
ENDPOINT = "0.0.0.0:9000"

class MinioClientNative:
    def __init__(self, bucket_name: str) -> None:
        client = Minio(
            ENDPOINT, 
            access_key=ACCESS_KEY, 
            secret_key=SECRET_KEY, 
            secure=False
        )
        self.client = client
        self.bucket_name = bucket_name

    def upload_file(self, file_path: Path):
        self.client.fput_object(
            self.bucket_name, 
            file_path.name, 
            file_path
        )

    def download_file(self, object_name: str, file_path: Path):
        self.client.fget_object(
            bucket_name=self.bucket_name,
            object_name=object_name,
            file_path=str(file_path),
        )

S3FS Client

Using the s3fs library for S3-compatible access:
minio_storage/minio_client.py
import s3fs
from pathlib import Path

class MinioClientS3:
    def __init__(self, bucket_name: str) -> None:
        fs = s3fs.S3FileSystem(
            key=ACCESS_KEY,
            secret=SECRET_KEY,
            use_ssl=False,
            client_kwargs={"endpoint_url": f"http://{ENDPOINT}"},
        )
        self.client = fs
        self.bucket_name = bucket_name

    def upload_file(self, file_path: Path):
        s3_file_path = f"s3://{self.bucket_name}/{file_path.name}"
        self.client.put(str(file_path), s3_file_path)

    def download_file(self, object_name: Path, file_path: Path):
        s3_file_path = f"s3://{self.bucket_name}/{object_name}"
        self.client.download(s3_file_path, str(file_path))

Testing

Comprehensive test suite using pytest:
minio_storage/test_minio_client.py
import uuid
from pathlib import Path
import pytest
from minio_client import MinioClientNative, MinioClientS3

@pytest.fixture()
def bucket_name() -> str:
    return "test"

@pytest.fixture()
def file_to_save(tmp_path: Path) -> Path:
    _file_to_save = tmp_path / f"{uuid.uuid4()}.mock"
    open(_file_to_save, "a").close()
    return _file_to_save

class TestMinioClientNative:
    def test_upload_file(
        self, 
        minio_client_native: MinioClientNative, 
        file_to_save: Path, 
        tmp_path: Path
    ):
        # Upload file
        minio_client_native.upload_file(file_to_save)
        
        # Download and verify
        path_to_save = tmp_path / "saved_file.mock"
        minio_client_native.download_file(
            object_name=file_to_save.name, 
            file_path=path_to_save
        )
        assert path_to_save.exists()

Run Tests

pytest -ss ./minio_storage/test_minio_client.py

Dataset Versioning with DVC

DVC (Data Version Control) tracks large files and datasets using Git-like semantics.

Initialize DVC

dvc init --subdir
git status
git commit -m "Initialize DVC"

Add Data Files

1

Create data

mkdir data
touch ./data/big-data.csv
2

Track with DVC

dvc add ./data/big-data.csv
git add data/.gitignore data/big-data.csv.dvc
git commit -m "Add raw data"

Configure MinIO as Remote Storage

1

Set credentials

export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
export AWS_ENDPOINT_URL=http://127.0.0.1:9000
2

Create bucket

aws s3api create-bucket --bucket ml-data
3

Add remote

dvc remote add -d minio s3://ml-data
dvc remote modify minio endpointurl $AWS_ENDPOINT_URL
4

Commit configuration

git add .dvc/config
git commit -m "Configure remote storage"
git push
5

Push data

dvc push

Pull Data

Team members can fetch the data:
git pull
dvc pull

Best Practices

  • Use strong credentials in production
  • Enable SSL/TLS for remote access
  • Implement IAM policies for bucket access
  • Rotate access keys regularly
  • Use multipart uploads for large files
  • Enable compression for text data
  • Implement connection pooling
  • Cache frequently accessed objects
  • Use consistent naming conventions
  • Organize by project/experiment/version
  • Tag objects with metadata
  • Implement lifecycle policies

Resources

Next Steps

Build docs developers (and LLMs) love