NemoDatastore - NVIDIA NIM Operator

NemoDatastore provides a Git-based storage solution for managing AI training datasets, model files, and experiment artifacts. Built on Gitea, it offers version control, collaboration features, and integration with object storage.

Overview

NemoDatastore enables you to:

Version control datasets with Git LFS
Store and track model files and checkpoints
Collaborate on dataset curation
Integrate with S3/MinIO for large file storage
Maintain reproducible experiment artifacts
Access data via Git or HTTP API

When to Use NemoDatastore

Dataset Versioning

Track changes to training datasets and maintain version history

Team Collaboration

Enable multiple team members to contribute to dataset curation

Artifact Storage

Store model checkpoints, configs, and experiment outputs

Reproducibility

Ensure experiments can be reproduced with exact dataset versions

Architecture

NemoDatastore uses Gitea with Git LFS and optional object storage:

Configuration

Complete Example

apiVersion: apps.nvidia.com/v1alpha1
kind: NemoDatastore
metadata:
  name: nemodatastore-sample
  namespace: nemo
spec:
  # Container image configuration
  image:
    repository: nvcr.io/nvidia/nemo-microservices/datastore
    tag: "25.08"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  
  # Service exposure
  expose:
    service:
      type: ClusterIP
      port: 8000
  
  # Replica configuration
  replicas: 1
  
  # Required secrets (must be created beforehand)
  secrets:
    # Gitea configuration scripts
    datastoreConfigSecret: "nemo-ms-nemo-datastore"
    datastoreInitSecret: "nemo-ms-nemo-datastore-init"
    datastoreInlineConfigSecret: "nemo-ms-nemo-datastore-inline-config"
    # Gitea admin credentials
    giteaAdminSecret: "gitea-admin-credentials"
    # Git LFS JWT secret
    lfsJwtSecret: "nemo-ms-nemo-datastore--lfs-jwt"
  
  # PostgreSQL database configuration
  databaseConfig:
    credentials:
      user: ndsuser
      secretName: datastore-pg-existing-secret
      passwordKey: password
    host: datastore-pg-postgresql.nemo.svc.cluster.local
    port: 5432
    databaseName: ndsdb
  
  # Optional: Object storage for Git LFS
  objectStoreConfig:
    # S3/MinIO credentials
    credentials:
      user: datastore-user
      secretName: minio-credentials
      passwordKey: password
    # S3 endpoint
    endpoint: minio.nemo.svc.cluster.local:9000
    bucketName: nemo-datastore-lfs
    region: us-east-1
    ssl: false
    serveDirect: true
  
  # Persistent storage
  pvc:
    name: "pvc-shared-data"
    create: true
    storageClass: ""
    volumeAccessMode: ReadWriteOnce
    size: "10Gi"
    # subPath: "datastore"  # Optional subdirectory
  
  # Resource limits
  resources:
    requests:
      memory: "256Mi"
      cpu: "500m"
    limits:
      memory: "512Mi"
      cpu: "1"

Key Configuration Fields

spec.secrets

object

required

Pre-created Kubernetes secrets for Gitea configuration.

giteaAdminSecret

string

required

Secret containing GITEA_ADMIN_USERNAME and GITEA_ADMIN_PASSWORD keys.

lfsJwtSecret

string

required

Secret containing jwtSecret key for Git LFS authentication.

datastoreConfigSecret

string

required

Secret containing config_environment.sh script.

datastoreInitSecret

string

required

Secret containing initialization scripts.

datastoreInlineConfigSecret

string

required

Secret containing inline configuration values.

spec.objectStoreConfig

object

Optional S3/MinIO configuration for Git LFS storage (recommended for large files).

serveDirect

boolean

default:"true"

Allow clients to upload/download directly to S3 (bypasses Gitea).

spec.pvc

object

Persistent volume for Git repositories and metadata.

Prerequisites

Create Required Secrets

Gitea Admin Credentials

kubectl create secret generic gitea-admin-credentials \
  --from-literal=GITEA_ADMIN_USERNAME=admin \
  --from-literal=GITEA_ADMIN_PASSWORD=$(openssl rand -base64 32) \
  -n nemo

LFS JWT Secret

kubectl create secret generic nemo-ms-nemo-datastore--lfs-jwt \
  --from-literal=jwtSecret=$(openssl rand -base64 32) \
  -n nemo

Configuration Secrets

Create secrets from Helm chart templates or documentation:

# These are typically provided by the NIM Operator installation
kubectl create secret generic nemo-ms-nemo-datastore --from-file=config_environment.sh -n nemo
kubectl create secret generic nemo-ms-nemo-datastore-init --from-file=init/ -n nemo
kubectl create secret generic nemo-ms-nemo-datastore-inline-config --from-file=inline-config/ -n nemo

Object Storage Integration

Using MinIO

Deploy MinIO for S3-compatible storage:

helm repo add minio https://charts.min.io/
helm install minio minio/minio \
  --set rootUser=admin \
  --set rootPassword=$(openssl rand -base64 32) \
  --set persistence.size=100Gi \
  -n nemo

Create bucket and user:

# Access MinIO pod
kubectl exec -it minio-0 -n nemo -- sh

# Create bucket
mc alias set local http://localhost:9000 admin <password>
mc mb local/nemo-datastore-lfs

# Create service account
mc admin user add local datastore-user <password>
mc admin policy attach local readwrite --user datastore-user

Create credentials secret:

kubectl create secret generic minio-credentials \
  --from-literal=password=<datastore-user-password> \
  -n nemo

Using AWS S3

objectStoreConfig:
  credentials:
    user: AKIAIOSFODNN7EXAMPLE
    secretName: aws-s3-credentials
    passwordKey: secret-key
  endpoint: s3.amazonaws.com
  bucketName: my-nemo-datastore
  region: us-west-2
  ssl: true
  serveDirect: true

Storage Options

PVC-Only Storage

For small deployments without object storage:

spec:
  pvc:
    create: true
    size: "50Gi"  # Must accommodate all Git repos and LFS files
  # objectStoreConfig: null  # Omit object storage

PVC-only storage may have performance limitations for large files and high concurrency.

Hybrid Storage

Recommended for production:

spec:
  pvc:
    create: true
    size: "10Gi"  # Just for metadata and small files
  objectStoreConfig:
    # LFS files stored in S3/MinIO
    endpoint: minio.nemo.svc.cluster.local:9000
    bucketName: nemo-datastore-lfs

Usage

Access Gitea UI

Port-forward to access the web interface:

kubectl port-forward svc/nemodatastore-sample 8000:8000 -n nemo

Open http://localhost:8000 and log in with admin credentials.

Git Clone via HTTP

# Get admin password
PASSWORD=$(kubectl get secret gitea-admin-credentials -n nemo -o jsonpath='{.data.GITEA_ADMIN_PASSWORD}' | base64 -d)

# Clone repository
git clone http://admin:$PASSWORD@nemodatastore-sample.nemo.svc.cluster.local:8000/datasets/my-dataset.git

Create Repository

Via API:

curl -X POST http://nemodatastore-sample.nemo.svc.cluster.local:8000/api/v1/user/repos \
  -H "Content-Type: application/json" \
  -u admin:$PASSWORD \
  -d '{
    "name": "training-datasets",
    "description": "Fine-tuning datasets",
    "private": true
  }'

Upload Dataset with Git LFS

git lfs install
git clone http://admin:$PASSWORD@nemodatastore-sample.nemo.svc.cluster.local:8000/datasets/training-datasets.git
cd training-datasets

# Track large files with LFS
git lfs track "*.jsonl"
git lfs track "*.parquet"

# Add and commit
cp /path/to/dataset.jsonl .
git add .gitattributes dataset.jsonl
git commit -m "Add training dataset v1.0"
git push origin main

Integration with NemoCustomizer

NemoCustomizer fetches datasets from NemoDatastore:

# NemoCustomizer config
spec:
  datastore:
    endpoint: http://nemodatastore-sample.nemo.svc.cluster.local:8000

Reference datasets in customization requests:

curl -X POST http://nemocustomizer.nemo.svc.cluster.local:8000/v1/customizations \
  -d '{
    "dataset_repo": "datasets/training-datasets",
    "dataset_path": "dataset.jsonl",
    "dataset_ref": "main"
  }'

Best Practices

Storage Strategy

Use object storage (S3/MinIO) for large files
Enable serveDirect for better performance
Size PVC appropriately for metadata
Use fast storage class for PVC

Access Control

Create separate users for different services
Use private repositories for sensitive data
Rotate admin credentials regularly
Use organization/team features for collaboration

Dataset Management

Use semantic versioning for dataset tags
Document dataset contents in README files
Store preprocessing scripts alongside data
Include data quality reports

Performance

Use Git LFS for files > 1MB
Avoid committing generated artifacts
Clean up old branches and tags
Monitor storage usage

Backup and Recovery

Backup Strategy

# Backup PostgreSQL database
kubectl exec -it datastore-pg-postgresql-0 -n nemo -- \
  pg_dump -U ndsuser ndsdb > datastore-backup.sql

# Backup PVC data
kubectl exec deployment/nemodatastore-sample -n nemo -- \
  tar czf - /data | cat > datastore-pvc-backup.tar.gz

# Backup object storage (if using MinIO)
mc mirror local/nemo-datastore-lfs /backup/lfs/

Restore

# Restore database
kubectl exec -i datastore-pg-postgresql-0 -n nemo -- \
  psql -U ndsuser ndsdb < datastore-backup.sql

# Restore PVC
kubectl exec -i deployment/nemodatastore-sample -n nemo -- \
  tar xzf - -C / < datastore-pvc-backup.tar.gz

Troubleshooting

Init Containers Fail

Check:

All required secrets exist
Secret keys match expected names
Database is accessible
PVC mounted successfully

LFS Upload Fails

Verify:

Object storage credentials are correct
Bucket exists and is accessible
Network connectivity to S3/MinIO
LFS JWT secret is valid

Performance Issues

Solutions:

Enable object storage direct serving
Use faster storage class for PVC
Increase resource limits
Check database performance

Next Steps

Upload Datasets

Learn dataset management

Configure Customizer

Integrate with NemoCustomizer

Git LFS Guide

Learn Git LFS

API Reference

Gitea API documentation

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

​Overview

​When to Use NemoDatastore

Dataset Versioning

Team Collaboration

Artifact Storage

Reproducibility

​Architecture

​Configuration

​Complete Example

​Key Configuration Fields

​Prerequisites

​Create Required Secrets

​Object Storage Integration

​Using MinIO

​Using AWS S3

​Storage Options

​PVC-Only Storage

​Hybrid Storage

​Usage

​Access Gitea UI

​Git Clone via HTTP

​Create Repository

​Upload Dataset with Git LFS

​Integration with NemoCustomizer

​Best Practices

​Backup and Recovery

​Backup Strategy

​Restore

​Troubleshooting

​Next Steps

Upload Datasets

Configure Customizer

Git LFS Guide

API Reference

Build docs developers (and LLMs) love

Overview

When to Use NemoDatastore

Architecture

Configuration

Complete Example

Key Configuration Fields

Prerequisites

Create Required Secrets

Object Storage Integration

Using MinIO

Using AWS S3

Storage Options

PVC-Only Storage

Hybrid Storage

Usage

Access Gitea UI

Git Clone via HTTP

Create Repository

Upload Dataset with Git LFS

Integration with NemoCustomizer

Best Practices

Backup and Recovery

Backup Strategy

Restore

Troubleshooting

Next Steps