Overview
The NemoCustomizer resource provides a customization service for fine-tuning NVIDIA NeMo models. It manages training jobs, model downloads, and integrations with MLflow, Weights & Biases, and external data stores.
API Group: apps.nvidia.com
API Version: v1alpha1
Kind: NemoCustomizer
Spec Fields
Container image configuration for the customizer service. Container image repository
Image pull policy (e.g., IfNotPresent, Always, Never)
List of image pull secret names
PostgreSQL database connection configuration. Database hostname (minimum length: 1)
Database name (minimum length: 1)
Database credentials configuration Non-root database username
Secret containing database password
Key in secret for password
NeMo Datastore service endpoint. HTTP(S) URL to datastore service (pattern: ^http, minimum length: 1)
NeMo Entitystore service endpoint. HTTP(S) URL to entitystore service (pattern: ^http, minimum length: 1)
MLflow tracking server endpoint. HTTP(S) URL to MLflow tracking server (pattern: ^http, minimum length: 1)
Weights & Biases configuration for experiment tracking. Secret containing WandB API key (minimum length: 1)
Key in secret holding WandB API key
encryptionKey
string
default: "encryptionKey"
Optional encryption key for WandB credentials
WandB username or team name for logging runs
projectName
string
default: "nvidia-nemo-customizer"
WandB project name
Configuration for data store CLI tools. Container image for data store tools (minimum length: 1)
Configuration for model download jobs. Container image for download jobs (minimum length: 1)
imagePullPolicy
string
default: "IfNotPresent"
Image pull policy (enum: Always, IfNotPresent, Never)
NGC API key secret configuration Hugging Face token secret configuration
Pod security context for download jobs
Time to live after job finishes in seconds (minimum: 60)
Polling interval for download status in seconds (minimum: 15)
Training job configuration. Optional ConfigMap with training configuration PVC for model artifacts Whether to create the PVC
Volume access mode (e.g., ReadWriteOnce)
Storage size (e.g., “50Gi”)
Workspace PVC configuration for training jobs mountPath
string
default: "/pvc/workspace"
Mount path within training job
Environment variables for training jobs
Network configuration for multi-node training
Node selector labels for training jobs
Tolerations for training jobs
Affinity rules for training jobs
Resource requirements for training jobs
TTL after training job finishes (seconds)
Timeout for training job (seconds)
Run.AI scheduler queue (only if scheduler is runai)
Max size of shared memory volume for training jobs
ConfigMap containing model definitions. ConfigMap name (minimum length: 1)
Scheduler configuration for training jobs. Scheduler type (enum: volcano, runai)
OpenTelemetry configuration. Enable OpenTelemetry tracing
OTLP collector endpoint URL
Disable Python logging auto-instrumentation
Exporter configuration Traces exporter (enum: otlp, console, none)
Metrics exporter (enum: otlp, console, none)
Logs exporter (enum: otlp, console, none)
excludedUrls
array
default: "[\"health\"]"
URLs to exclude from tracing
Log level (enum: INFO, DEBUG)
Service exposure configuration. Service configuration Service type (e.g., ClusterIP, LoadBalancer)
Ingress/Gateway router configuration
Number of replicas (minimum: 1). Cannot be set when autoscaling is enabled.
Autoscaling configuration. Enable horizontal pod autoscaling
Metrics collection configuration. Enable metrics collection
Prometheus ServiceMonitor configuration
Override container command
Additional environment variables
Resource requirements (CPU, memory, GPU)
User ID for container security context (default: 1000)
Group ID for container security context (default: 2000)
Status Fields
Number of available replicas
Current state (Pending, NotReady, Ready, Failed)
Example
apiVersion : apps.nvidia.com/v1alpha1
kind : NemoCustomizer
metadata :
name : nemocustomizer-sample
namespace : nemo
spec :
scheduler :
type : "volcano"
wandb :
secretName : wandb-secret
apiKeyKey : apiKey
encryptionKey : encryptionKey
otel :
enabled : true
exporterOtlpEndpoint : http://customizer-otel-opentelemetry-collector.nemo.svc.cluster.local:4317
databaseConfig :
credentials :
user : ncsuser
secretName : customizer-pg-existing-secret
passwordKey : password
host : customizer-pg-postgresql.nemo.svc.cluster.local
port : 5432
databaseName : ncsdb
expose :
service :
type : ClusterIP
port : 8000
image :
repository : nvcr.io/nvidia/nemo-microservices/customizer-api
tag : "25.08"
pullPolicy : IfNotPresent
pullSecrets :
- ngc-secret
entitystore :
endpoint : http://nemoentitystore-sample.nemo.svc.cluster.local:8000
datastore :
endpoint : http://nemodatastore-sample.nemo.svc.cluster.local:8000
mlflow :
endpoint : http://mlflow-tracking.nemo.svc.cluster.local:80
nemoDatastoreTools :
image : nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.08
modelDownloadJobs :
image : "nvcr.io/nvidia/nemo-microservices/customizer-api:25.08"
ngcAPISecret :
name : ngc-api-secret
key : "NGC_API_KEY"
securityContext :
fsGroup : 1000
runAsNonRoot : true
runAsUser : 1000
runAsGroup : 1000
ttlSecondsAfterFinished : 600
pollIntervalSeconds : 15
modelConfig :
name : nemo-model-config
trainingConfig :
configMap :
name : nemo-training-config
modelPVC :
create : true
name : finetuning-ms-models-pvc
storageClass : ""
volumeAccessMode : ReadWriteOnce
size : 50Gi
workspacePVC :
storageClass : "local-path"
volumeAccessMode : ReadWriteOnce
size : 10Gi
mountPath : /pvc/workspace
image :
repository : nvcr.io/nvidia/nemo-microservices/customizer
tag : "25.08"
env :
- name : LOG_LEVEL
value : INFO
networkConfig :
- name : NCCL_IB_SL
value : "0"
- name : UCX_TLS
value : TCP
ttlSecondsAfterFinished : 3600
timeout : 3600
tolerations :
- key : "nvidia.com/gpu"
operator : "Exists"
effect : "NoSchedule"