Skip to main content
NemoDatastore provides a Git-based storage solution for managing AI training datasets, model files, and experiment artifacts. Built on Gitea, it offers version control, collaboration features, and integration with object storage.

Overview

NemoDatastore enables you to:
  • Version control datasets with Git LFS
  • Store and track model files and checkpoints
  • Collaborate on dataset curation
  • Integrate with S3/MinIO for large file storage
  • Maintain reproducible experiment artifacts
  • Access data via Git or HTTP API

When to Use NemoDatastore

Dataset Versioning

Track changes to training datasets and maintain version history

Team Collaboration

Enable multiple team members to contribute to dataset curation

Artifact Storage

Store model checkpoints, configs, and experiment outputs

Reproducibility

Ensure experiments can be reproduced with exact dataset versions

Architecture

NemoDatastore uses Gitea with Git LFS and optional object storage:

Configuration

Complete Example

apiVersion: apps.nvidia.com/v1alpha1
kind: NemoDatastore
metadata:
  name: nemodatastore-sample
  namespace: nemo
spec:
  # Container image configuration
  image:
    repository: nvcr.io/nvidia/nemo-microservices/datastore
    tag: "25.08"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  
  # Service exposure
  expose:
    service:
      type: ClusterIP
      port: 8000
  
  # Replica configuration
  replicas: 1
  
  # Required secrets (must be created beforehand)
  secrets:
    # Gitea configuration scripts
    datastoreConfigSecret: "nemo-ms-nemo-datastore"
    datastoreInitSecret: "nemo-ms-nemo-datastore-init"
    datastoreInlineConfigSecret: "nemo-ms-nemo-datastore-inline-config"
    # Gitea admin credentials
    giteaAdminSecret: "gitea-admin-credentials"
    # Git LFS JWT secret
    lfsJwtSecret: "nemo-ms-nemo-datastore--lfs-jwt"
  
  # PostgreSQL database configuration
  databaseConfig:
    credentials:
      user: ndsuser
      secretName: datastore-pg-existing-secret
      passwordKey: password
    host: datastore-pg-postgresql.nemo.svc.cluster.local
    port: 5432
    databaseName: ndsdb
  
  # Optional: Object storage for Git LFS
  objectStoreConfig:
    # S3/MinIO credentials
    credentials:
      user: datastore-user
      secretName: minio-credentials
      passwordKey: password
    # S3 endpoint
    endpoint: minio.nemo.svc.cluster.local:9000
    bucketName: nemo-datastore-lfs
    region: us-east-1
    ssl: false
    serveDirect: true
  
  # Persistent storage
  pvc:
    name: "pvc-shared-data"
    create: true
    storageClass: ""
    volumeAccessMode: ReadWriteOnce
    size: "10Gi"
    # subPath: "datastore"  # Optional subdirectory
  
  # Resource limits
  resources:
    requests:
      memory: "256Mi"
      cpu: "500m"
    limits:
      memory: "512Mi"
      cpu: "1"

Key Configuration Fields

spec.secrets
object
required
Pre-created Kubernetes secrets for Gitea configuration.
giteaAdminSecret
string
required
Secret containing GITEA_ADMIN_USERNAME and GITEA_ADMIN_PASSWORD keys.
lfsJwtSecret
string
required
Secret containing jwtSecret key for Git LFS authentication.
datastoreConfigSecret
string
required
Secret containing config_environment.sh script.
datastoreInitSecret
string
required
Secret containing initialization scripts.
datastoreInlineConfigSecret
string
required
Secret containing inline configuration values.
spec.objectStoreConfig
object
Optional S3/MinIO configuration for Git LFS storage (recommended for large files).
serveDirect
boolean
default:"true"
Allow clients to upload/download directly to S3 (bypasses Gitea).
spec.pvc
object
Persistent volume for Git repositories and metadata.

Prerequisites

Create Required Secrets

1

Gitea Admin Credentials

kubectl create secret generic gitea-admin-credentials \
  --from-literal=GITEA_ADMIN_USERNAME=admin \
  --from-literal=GITEA_ADMIN_PASSWORD=$(openssl rand -base64 32) \
  -n nemo
2

LFS JWT Secret

kubectl create secret generic nemo-ms-nemo-datastore--lfs-jwt \
  --from-literal=jwtSecret=$(openssl rand -base64 32) \
  -n nemo
3

Configuration Secrets

Create secrets from Helm chart templates or documentation:
# These are typically provided by the NIM Operator installation
kubectl create secret generic nemo-ms-nemo-datastore --from-file=config_environment.sh -n nemo
kubectl create secret generic nemo-ms-nemo-datastore-init --from-file=init/ -n nemo
kubectl create secret generic nemo-ms-nemo-datastore-inline-config --from-file=inline-config/ -n nemo

Object Storage Integration

Using MinIO

Deploy MinIO for S3-compatible storage:
helm repo add minio https://charts.min.io/
helm install minio minio/minio \
  --set rootUser=admin \
  --set rootPassword=$(openssl rand -base64 32) \
  --set persistence.size=100Gi \
  -n nemo
Create bucket and user:
# Access MinIO pod
kubectl exec -it minio-0 -n nemo -- sh

# Create bucket
mc alias set local http://localhost:9000 admin <password>
mc mb local/nemo-datastore-lfs

# Create service account
mc admin user add local datastore-user <password>
mc admin policy attach local readwrite --user datastore-user
Create credentials secret:
kubectl create secret generic minio-credentials \
  --from-literal=password=<datastore-user-password> \
  -n nemo

Using AWS S3

objectStoreConfig:
  credentials:
    user: AKIAIOSFODNN7EXAMPLE
    secretName: aws-s3-credentials
    passwordKey: secret-key
  endpoint: s3.amazonaws.com
  bucketName: my-nemo-datastore
  region: us-west-2
  ssl: true
  serveDirect: true

Storage Options

PVC-Only Storage

For small deployments without object storage:
spec:
  pvc:
    create: true
    size: "50Gi"  # Must accommodate all Git repos and LFS files
  # objectStoreConfig: null  # Omit object storage
PVC-only storage may have performance limitations for large files and high concurrency.

Hybrid Storage

Recommended for production:
spec:
  pvc:
    create: true
    size: "10Gi"  # Just for metadata and small files
  objectStoreConfig:
    # LFS files stored in S3/MinIO
    endpoint: minio.nemo.svc.cluster.local:9000
    bucketName: nemo-datastore-lfs

Usage

Access Gitea UI

Port-forward to access the web interface:
kubectl port-forward svc/nemodatastore-sample 8000:8000 -n nemo
Open http://localhost:8000 and log in with admin credentials.

Git Clone via HTTP

# Get admin password
PASSWORD=$(kubectl get secret gitea-admin-credentials -n nemo -o jsonpath='{.data.GITEA_ADMIN_PASSWORD}' | base64 -d)

# Clone repository
git clone http://admin:$PASSWORD@nemodatastore-sample.nemo.svc.cluster.local:8000/datasets/my-dataset.git

Create Repository

Via API:
curl -X POST http://nemodatastore-sample.nemo.svc.cluster.local:8000/api/v1/user/repos \
  -H "Content-Type: application/json" \
  -u admin:$PASSWORD \
  -d '{
    "name": "training-datasets",
    "description": "Fine-tuning datasets",
    "private": true
  }'

Upload Dataset with Git LFS

git lfs install
git clone http://admin:$PASSWORD@nemodatastore-sample.nemo.svc.cluster.local:8000/datasets/training-datasets.git
cd training-datasets

# Track large files with LFS
git lfs track "*.jsonl"
git lfs track "*.parquet"

# Add and commit
cp /path/to/dataset.jsonl .
git add .gitattributes dataset.jsonl
git commit -m "Add training dataset v1.0"
git push origin main

Integration with NemoCustomizer

NemoCustomizer fetches datasets from NemoDatastore:
# NemoCustomizer config
spec:
  datastore:
    endpoint: http://nemodatastore-sample.nemo.svc.cluster.local:8000
Reference datasets in customization requests:
curl -X POST http://nemocustomizer.nemo.svc.cluster.local:8000/v1/customizations \
  -d '{
    "dataset_repo": "datasets/training-datasets",
    "dataset_path": "dataset.jsonl",
    "dataset_ref": "main"
  }'

Best Practices

  • Use object storage (S3/MinIO) for large files
  • Enable serveDirect for better performance
  • Size PVC appropriately for metadata
  • Use fast storage class for PVC
  • Create separate users for different services
  • Use private repositories for sensitive data
  • Rotate admin credentials regularly
  • Use organization/team features for collaboration
  • Use semantic versioning for dataset tags
  • Document dataset contents in README files
  • Store preprocessing scripts alongside data
  • Include data quality reports
  • Use Git LFS for files > 1MB
  • Avoid committing generated artifacts
  • Clean up old branches and tags
  • Monitor storage usage

Backup and Recovery

Backup Strategy

# Backup PostgreSQL database
kubectl exec -it datastore-pg-postgresql-0 -n nemo -- \
  pg_dump -U ndsuser ndsdb > datastore-backup.sql

# Backup PVC data
kubectl exec deployment/nemodatastore-sample -n nemo -- \
  tar czf - /data | cat > datastore-pvc-backup.tar.gz

# Backup object storage (if using MinIO)
mc mirror local/nemo-datastore-lfs /backup/lfs/

Restore

# Restore database
kubectl exec -i datastore-pg-postgresql-0 -n nemo -- \
  psql -U ndsuser ndsdb < datastore-backup.sql

# Restore PVC
kubectl exec -i deployment/nemodatastore-sample -n nemo -- \
  tar xzf - -C / < datastore-pvc-backup.tar.gz

Troubleshooting

Check:
  • All required secrets exist
  • Secret keys match expected names
  • Database is accessible
  • PVC mounted successfully
Verify:
  • Object storage credentials are correct
  • Bucket exists and is accessible
  • Network connectivity to S3/MinIO
  • LFS JWT secret is valid
Solutions:
  • Enable object storage direct serving
  • Use faster storage class for PVC
  • Increase resource limits
  • Check database performance

Next Steps

Upload Datasets

Learn dataset management

Configure Customizer

Integrate with NemoCustomizer

Git LFS Guide

Learn Git LFS

API Reference

Gitea API documentation

Build docs developers (and LLMs) love