Skip to main content

Overview

The NemoCustomizer resource provides a customization service for fine-tuning NVIDIA NeMo models. It manages training jobs, model downloads, and integrations with MLflow, Weights & Biases, and external data stores. API Group: apps.nvidia.com
API Version: v1alpha1
Kind: NemoCustomizer

Spec Fields

image
object
required
Container image configuration for the customizer service.
databaseConfig
object
required
PostgreSQL database connection configuration.
datastore
object
required
NeMo Datastore service endpoint.
entitystore
object
required
NeMo Entitystore service endpoint.
mlflow
object
required
MLflow tracking server endpoint.
wandb
object
required
Weights & Biases configuration for experiment tracking.
nemoDatastoreTools
object
required
Configuration for data store CLI tools.
modelDownloadJobs
object
required
Configuration for model download jobs.
trainingConfig
object
required
Training job configuration.
modelConfig
object
required
ConfigMap containing model definitions.
scheduler
object
Scheduler configuration for training jobs.
otel
object
OpenTelemetry configuration.
expose
object
Service exposure configuration.
replicas
integer
default:"1"
Number of replicas (minimum: 1). Cannot be set when autoscaling is enabled.
scale
object
Autoscaling configuration.
metrics
object
Metrics collection configuration.
command
array
Override container command
args
array
Container arguments
env
array
Additional environment variables
resources
object
Resource requirements (CPU, memory, GPU)
nodeSelector
object
Node selector labels
tolerations
array
Pod tolerations
affinity
object
Pod affinity rules
labels
object
Custom labels
annotations
object
Custom annotations
userID
integer
User ID for container security context (default: 1000)
groupID
integer
Group ID for container security context (default: 2000)
runtimeClass
string
Runtime class name

Status Fields

conditions
array
Current state conditions
availableReplicas
integer
Number of available replicas
state
string
Current state (Pending, NotReady, Ready, Failed)

Example

apiVersion: apps.nvidia.com/v1alpha1
kind: NemoCustomizer
metadata:
  name: nemocustomizer-sample
  namespace: nemo
spec:
  scheduler:
    type: "volcano"
  wandb:
    secretName: wandb-secret
    apiKeyKey: apiKey
    encryptionKey: encryptionKey
  otel:
    enabled: true
    exporterOtlpEndpoint: http://customizer-otel-opentelemetry-collector.nemo.svc.cluster.local:4317
  databaseConfig:
    credentials:
      user: ncsuser
      secretName: customizer-pg-existing-secret
      passwordKey: password
    host: customizer-pg-postgresql.nemo.svc.cluster.local
    port: 5432
    databaseName: ncsdb
  expose:
    service:
      type: ClusterIP
      port: 8000
  image:
    repository: nvcr.io/nvidia/nemo-microservices/customizer-api
    tag: "25.08"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  entitystore:
    endpoint: http://nemoentitystore-sample.nemo.svc.cluster.local:8000
  datastore:
    endpoint: http://nemodatastore-sample.nemo.svc.cluster.local:8000
  mlflow: 
    endpoint: http://mlflow-tracking.nemo.svc.cluster.local:80
  nemoDatastoreTools:
    image: nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.08
  modelDownloadJobs:
    image: "nvcr.io/nvidia/nemo-microservices/customizer-api:25.08"
    ngcAPISecret:
      name: ngc-api-secret
      key: "NGC_API_KEY"
    securityContext:
      fsGroup: 1000
      runAsNonRoot: true
      runAsUser: 1000
      runAsGroup: 1000
    ttlSecondsAfterFinished: 600
    pollIntervalSeconds: 15
  modelConfig:
    name: nemo-model-config
  trainingConfig:
    configMap:
      name: nemo-training-config
    modelPVC:
      create: true
      name: finetuning-ms-models-pvc
      storageClass: ""
      volumeAccessMode: ReadWriteOnce
      size: 50Gi
    workspacePVC:
      storageClass: "local-path"
      volumeAccessMode: ReadWriteOnce
      size: 10Gi
      mountPath: /pvc/workspace
    image:
      repository: nvcr.io/nvidia/nemo-microservices/customizer
      tag: "25.08"
    env:
      - name: LOG_LEVEL
        value: INFO
    networkConfig:
      - name: NCCL_IB_SL
        value: "0"
      - name: UCX_TLS
        value: TCP
    ttlSecondsAfterFinished: 3600
    timeout: 3600
    tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"

Build docs developers (and LLMs) love