NemoDatastore provides a Git-based storage solution for managing AI training datasets, model files, and experiment artifacts. Built on Gitea, it offers version control, collaboration features, and integration with object storage.
Overview
NemoDatastore enables you to:
Version control datasets with Git LFS
Store and track model files and checkpoints
Collaborate on dataset curation
Integrate with S3/MinIO for large file storage
Maintain reproducible experiment artifacts
Access data via Git or HTTP API
When to Use NemoDatastore
Dataset Versioning Track changes to training datasets and maintain version history
Team Collaboration Enable multiple team members to contribute to dataset curation
Artifact Storage Store model checkpoints, configs, and experiment outputs
Reproducibility Ensure experiments can be reproduced with exact dataset versions
Architecture
NemoDatastore uses Gitea with Git LFS and optional object storage:
Configuration
Complete Example
apiVersion : apps.nvidia.com/v1alpha1
kind : NemoDatastore
metadata :
name : nemodatastore-sample
namespace : nemo
spec :
# Container image configuration
image :
repository : nvcr.io/nvidia/nemo-microservices/datastore
tag : "25.08"
pullPolicy : IfNotPresent
pullSecrets :
- ngc-secret
# Service exposure
expose :
service :
type : ClusterIP
port : 8000
# Replica configuration
replicas : 1
# Required secrets (must be created beforehand)
secrets :
# Gitea configuration scripts
datastoreConfigSecret : "nemo-ms-nemo-datastore"
datastoreInitSecret : "nemo-ms-nemo-datastore-init"
datastoreInlineConfigSecret : "nemo-ms-nemo-datastore-inline-config"
# Gitea admin credentials
giteaAdminSecret : "gitea-admin-credentials"
# Git LFS JWT secret
lfsJwtSecret : "nemo-ms-nemo-datastore--lfs-jwt"
# PostgreSQL database configuration
databaseConfig :
credentials :
user : ndsuser
secretName : datastore-pg-existing-secret
passwordKey : password
host : datastore-pg-postgresql.nemo.svc.cluster.local
port : 5432
databaseName : ndsdb
# Optional: Object storage for Git LFS
objectStoreConfig :
# S3/MinIO credentials
credentials :
user : datastore-user
secretName : minio-credentials
passwordKey : password
# S3 endpoint
endpoint : minio.nemo.svc.cluster.local:9000
bucketName : nemo-datastore-lfs
region : us-east-1
ssl : false
serveDirect : true
# Persistent storage
pvc :
name : "pvc-shared-data"
create : true
storageClass : ""
volumeAccessMode : ReadWriteOnce
size : "10Gi"
# subPath: "datastore" # Optional subdirectory
# Resource limits
resources :
requests :
memory : "256Mi"
cpu : "500m"
limits :
memory : "512Mi"
cpu : "1"
Key Configuration Fields
Pre-created Kubernetes secrets for Gitea configuration. Secret containing GITEA_ADMIN_USERNAME and GITEA_ADMIN_PASSWORD keys.
Secret containing jwtSecret key for Git LFS authentication.
Secret containing config_environment.sh script.
Secret containing initialization scripts.
datastoreInlineConfigSecret
Secret containing inline configuration values.
Optional S3/MinIO configuration for Git LFS storage (recommended for large files). Allow clients to upload/download directly to S3 (bypasses Gitea).
Persistent volume for Git repositories and metadata.
Prerequisites
Create Required Secrets
Gitea Admin Credentials
kubectl create secret generic gitea-admin-credentials \
--from-literal=GITEA_ADMIN_USERNAME=admin \
--from-literal=GITEA_ADMIN_PASSWORD=$( openssl rand -base64 32 ) \
-n nemo
LFS JWT Secret
kubectl create secret generic nemo-ms-nemo-datastore--lfs-jwt \
--from-literal=jwtSecret=$( openssl rand -base64 32 ) \
-n nemo
Configuration Secrets
Create secrets from Helm chart templates or documentation: # These are typically provided by the NIM Operator installation
kubectl create secret generic nemo-ms-nemo-datastore --from-file=config_environment.sh -n nemo
kubectl create secret generic nemo-ms-nemo-datastore-init --from-file=init/ -n nemo
kubectl create secret generic nemo-ms-nemo-datastore-inline-config --from-file=inline-config/ -n nemo
Object Storage Integration
Using MinIO
Deploy MinIO for S3-compatible storage:
helm repo add minio https://charts.min.io/
helm install minio minio/minio \
--set rootUser=admin \
--set rootPassword= $( openssl rand -base64 32 ) \
--set persistence.size=100Gi \
-n nemo
Create bucket and user:
# Access MinIO pod
kubectl exec -it minio-0 -n nemo -- sh
# Create bucket
mc alias set local http://localhost:9000 admin < passwor d >
mc mb local/nemo-datastore-lfs
# Create service account
mc admin user add local datastore-user < passwor d >
mc admin policy attach local readwrite --user datastore-user
Create credentials secret:
kubectl create secret generic minio-credentials \
--from-literal=password= < datastore-user-password > \
-n nemo
Using AWS S3
objectStoreConfig :
credentials :
user : AKIAIOSFODNN7EXAMPLE
secretName : aws-s3-credentials
passwordKey : secret-key
endpoint : s3.amazonaws.com
bucketName : my-nemo-datastore
region : us-west-2
ssl : true
serveDirect : true
Storage Options
PVC-Only Storage
For small deployments without object storage:
spec :
pvc :
create : true
size : "50Gi" # Must accommodate all Git repos and LFS files
# objectStoreConfig: null # Omit object storage
PVC-only storage may have performance limitations for large files and high concurrency.
Hybrid Storage
Recommended for production:
spec :
pvc :
create : true
size : "10Gi" # Just for metadata and small files
objectStoreConfig :
# LFS files stored in S3/MinIO
endpoint : minio.nemo.svc.cluster.local:9000
bucketName : nemo-datastore-lfs
Usage
Access Gitea UI
Port-forward to access the web interface:
kubectl port-forward svc/nemodatastore-sample 8000:8000 -n nemo
Open http://localhost:8000 and log in with admin credentials.
Git Clone via HTTP
# Get admin password
PASSWORD = $( kubectl get secret gitea-admin-credentials -n nemo -o jsonpath='{.data.GITEA_ADMIN_PASSWORD}' | base64 -d )
# Clone repository
git clone http://admin: $PASSWORD @nemodatastore-sample.nemo.svc.cluster.local:8000/datasets/my-dataset.git
Create Repository
Via API:
curl -X POST http://nemodatastore-sample.nemo.svc.cluster.local:8000/api/v1/user/repos \
-H "Content-Type: application/json" \
-u admin: $PASSWORD \
-d '{
"name": "training-datasets",
"description": "Fine-tuning datasets",
"private": true
}'
Upload Dataset with Git LFS
git lfs install
git clone http://admin: $PASSWORD @nemodatastore-sample.nemo.svc.cluster.local:8000/datasets/training-datasets.git
cd training-datasets
# Track large files with LFS
git lfs track "*.jsonl"
git lfs track "*.parquet"
# Add and commit
cp /path/to/dataset.jsonl .
git add .gitattributes dataset.jsonl
git commit -m "Add training dataset v1.0"
git push origin main
Integration with NemoCustomizer
NemoCustomizer fetches datasets from NemoDatastore:
# NemoCustomizer config
spec :
datastore :
endpoint : http://nemodatastore-sample.nemo.svc.cluster.local:8000
Reference datasets in customization requests:
curl -X POST http://nemocustomizer.nemo.svc.cluster.local:8000/v1/customizations \
-d '{
"dataset_repo": "datasets/training-datasets",
"dataset_path": "dataset.jsonl",
"dataset_ref": "main"
}'
Best Practices
Use object storage (S3/MinIO) for large files
Enable serveDirect for better performance
Size PVC appropriately for metadata
Use fast storage class for PVC
Create separate users for different services
Use private repositories for sensitive data
Rotate admin credentials regularly
Use organization/team features for collaboration
Use semantic versioning for dataset tags
Document dataset contents in README files
Store preprocessing scripts alongside data
Include data quality reports
Backup and Recovery
Backup Strategy
# Backup PostgreSQL database
kubectl exec -it datastore-pg-postgresql-0 -n nemo -- \
pg_dump -U ndsuser ndsdb > datastore-backup.sql
# Backup PVC data
kubectl exec deployment/nemodatastore-sample -n nemo -- \
tar czf - /data | cat > datastore-pvc-backup.tar.gz
# Backup object storage (if using MinIO)
mc mirror local/nemo-datastore-lfs /backup/lfs/
Restore
# Restore database
kubectl exec -i datastore-pg-postgresql-0 -n nemo -- \
psql -U ndsuser ndsdb < datastore-backup.sql
# Restore PVC
kubectl exec -i deployment/nemodatastore-sample -n nemo -- \
tar xzf - -C / < datastore-pvc-backup.tar.gz
Troubleshooting
Check:
All required secrets exist
Secret keys match expected names
Database is accessible
PVC mounted successfully
Verify:
Object storage credentials are correct
Bucket exists and is accessible
Network connectivity to S3/MinIO
LFS JWT secret is valid
Next Steps
Upload Datasets Learn dataset management
Configure Customizer Integrate with NemoCustomizer
Git LFS Guide Learn Git LFS
API Reference Gitea API documentation