NemoCustomizer

Overview

The NemoCustomizer resource provides a customization service for fine-tuning NVIDIA NeMo models. It manages training jobs, model downloads, and integrations with MLflow, Weights & Biases, and external data stores. API Group: apps.nvidia.com
API Version: v1alpha1
Kind: NemoCustomizer

Spec Fields

image

object

required

Container image configuration for the customizer service.

Show properties

repository

string

required

Container image repository

tag

string

required

Image tag

pullPolicy

string

Image pull policy (e.g., IfNotPresent, Always, Never)

pullSecrets

array

List of image pull secret names

databaseConfig

object

required

PostgreSQL database connection configuration.

Show properties

host

string

required

Database hostname (minimum length: 1)

port

integer

default:"5432"

Database port (1-65535)

databaseName

string

required

Database name (minimum length: 1)

credentials

object

required

Database credentials configuration

Show properties

user

string

required

Non-root database username

secretName

string

required

Secret containing database password

passwordKey

string

default:"password"

Key in secret for password

datastore

object

required

NeMo Datastore service endpoint.

Show properties

endpoint

string

required

HTTP(S) URL to datastore service (pattern: ^http, minimum length: 1)

entitystore

object

required

NeMo Entitystore service endpoint.

Show properties

endpoint

string

required

HTTP(S) URL to entitystore service (pattern: ^http, minimum length: 1)

mlflow

object

required

MLflow tracking server endpoint.

Show properties

endpoint

string

required

HTTP(S) URL to MLflow tracking server (pattern: ^http, minimum length: 1)

wandb

object

required

Weights & Biases configuration for experiment tracking.

Show properties

secretName

string

required

Secret containing WandB API key (minimum length: 1)

apiKeyKey

string

default:"apiKey"

Key in secret holding WandB API key

encryptionKey

string

default:"encryptionKey"

Optional encryption key for WandB credentials

entity

string

WandB username or team name for logging runs

projectName

string

default:"nvidia-nemo-customizer"

WandB project name

nemoDatastoreTools

object

required

Configuration for data store CLI tools.

Show properties

image

string

required

Container image for data store tools (minimum length: 1)

modelDownloadJobs

object

required

Configuration for model download jobs.

Show properties

image

string

required

Container image for download jobs (minimum length: 1)

imagePullPolicy

string

default:"IfNotPresent"

Image pull policy (enum: Always, IfNotPresent, Never)

ngcAPISecret

object

NGC API key secret configuration

Show properties

name

string

Secret name

key

string

Key in secret

hfSecret

object

Hugging Face token secret configuration

securityContext

object

Pod security context for download jobs

ttlSecondsAfterFinished

integer

required

Time to live after job finishes in seconds (minimum: 60)

pollIntervalSeconds

integer

required

Polling interval for download status in seconds (minimum: 15)

trainingConfig

object

required

Training job configuration.

Show properties

configMap

object

Optional ConfigMap with training configuration

Show properties

name

string

required

ConfigMap name

modelPVC

object

required

PVC for model artifacts

Show properties

name

string

PVC name

create

boolean

Whether to create the PVC

storageClass

string

Storage class name

volumeAccessMode

string

Volume access mode (e.g., ReadWriteOnce)

size

string

Storage size (e.g., “50Gi”)

workspacePVC

object

required

Workspace PVC configuration for training jobs

Show properties

storageClass

string

Storage class for PVC

size

string

Workspace size in Gi

volumeAccessMode

string

Volume access mode

mountPath

string

default:"/pvc/workspace"

Mount path within training job

image

object

required

Training container image

env

array

Environment variables for training jobs

networkConfig

array

Network configuration for multi-node training

nodeSelector

object

Node selector labels for training jobs

tolerations

array

Tolerations for training jobs

affinity

object

Affinity rules for training jobs

resources

object

Resource requirements for training jobs

ttlSecondsAfterFinished

integer

TTL after training job finishes (seconds)

timeout

integer

Timeout for training job (seconds)

runaiQueue

string

default:"default"

Run.AI scheduler queue (only if scheduler is runai)

sharedMemorySizeLimit

string

Max size of shared memory volume for training jobs

modelConfig

object

required

ConfigMap containing model definitions.

Show properties

name

string

required

ConfigMap name (minimum length: 1)

scheduler

object

Scheduler configuration for training jobs.

Show properties

type

string

default:"volcano"

Scheduler type (enum: volcano, runai)

otel

object

OpenTelemetry configuration.

Show properties

enabled

boolean

Enable OpenTelemetry tracing

exporterOtlpEndpoint

string

OTLP collector endpoint URL

disableLogging

boolean

Disable Python logging auto-instrumentation

exporterConfig

object

Exporter configuration

Show properties

tracesExporter

string

default:"otlp"

Traces exporter (enum: otlp, console, none)

metricsExporter

string

default:"otlp"

Metrics exporter (enum: otlp, console, none)

logsExporter

string

default:"otlp"

Logs exporter (enum: otlp, console, none)

excludedUrls

array

default:"[\"health\"]"

URLs to exclude from tracing

logLevel

string

default:"INFO"

Log level (enum: INFO, DEBUG)

expose

object

Service exposure configuration.

Show properties

service

object

Service configuration

Show properties

type

string

Service type (e.g., ClusterIP, LoadBalancer)

port

integer

default:"8000"

Service port (1-65535)

annotations

object

Service annotations

router

object

Ingress/Gateway router configuration

replicas

integer

default:"1"

Number of replicas (minimum: 1). Cannot be set when autoscaling is enabled.

scale

object

Autoscaling configuration.

Show properties

enabled

boolean

Enable horizontal pod autoscaling

hpa

object

HPA specification

Show properties

minReplicas

integer

default:"1"

Minimum replicas

maxReplicas

integer

required

Maximum replicas

metrics

array

Metrics for autoscaling

behavior

object

Scaling behavior

annotations

object

HPA annotations

metrics

object

Metrics collection configuration.

Show properties

enabled

boolean

Enable metrics collection

serviceMonitor

object

Prometheus ServiceMonitor configuration

command

array

Override container command

args

array

Container arguments

env

array

Additional environment variables

resources

object

Resource requirements (CPU, memory, GPU)

nodeSelector

object

Node selector labels

tolerations

array

Pod tolerations

affinity

object

Pod affinity rules

labels

object

Custom labels

annotations

object

Custom annotations

userID

integer

User ID for container security context (default: 1000)

groupID

integer

Group ID for container security context (default: 2000)

runtimeClass

string

Runtime class name

Status Fields

conditions

array

Current state conditions

availableReplicas

integer

Number of available replicas

state

string

Current state (Pending, NotReady, Ready, Failed)

Example

apiVersion: apps.nvidia.com/v1alpha1
kind: NemoCustomizer
metadata:
  name: nemocustomizer-sample
  namespace: nemo
spec:
  scheduler:
    type: "volcano"
  wandb:
    secretName: wandb-secret
    apiKeyKey: apiKey
    encryptionKey: encryptionKey
  otel:
    enabled: true
    exporterOtlpEndpoint: http://customizer-otel-opentelemetry-collector.nemo.svc.cluster.local:4317
  databaseConfig:
    credentials:
      user: ncsuser
      secretName: customizer-pg-existing-secret
      passwordKey: password
    host: customizer-pg-postgresql.nemo.svc.cluster.local
    port: 5432
    databaseName: ncsdb
  expose:
    service:
      type: ClusterIP
      port: 8000
  image:
    repository: nvcr.io/nvidia/nemo-microservices/customizer-api
    tag: "25.08"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  entitystore:
    endpoint: http://nemoentitystore-sample.nemo.svc.cluster.local:8000
  datastore:
    endpoint: http://nemodatastore-sample.nemo.svc.cluster.local:8000
  mlflow: 
    endpoint: http://mlflow-tracking.nemo.svc.cluster.local:80
  nemoDatastoreTools:
    image: nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.08
  modelDownloadJobs:
    image: "nvcr.io/nvidia/nemo-microservices/customizer-api:25.08"
    ngcAPISecret:
      name: ngc-api-secret
      key: "NGC_API_KEY"
    securityContext:
      fsGroup: 1000
      runAsNonRoot: true
      runAsUser: 1000
      runAsGroup: 1000
    ttlSecondsAfterFinished: 600
    pollIntervalSeconds: 15
  modelConfig:
    name: nemo-model-config
  trainingConfig:
    configMap:
      name: nemo-training-config
    modelPVC:
      create: true
      name: finetuning-ms-models-pvc
      storageClass: ""
      volumeAccessMode: ReadWriteOnce
      size: 50Gi
    workspacePVC:
      storageClass: "local-path"
      volumeAccessMode: ReadWriteOnce
      size: 10Gi
      mountPath: /pvc/workspace
    image:
      repository: nvcr.io/nvidia/nemo-microservices/customizer
      tag: "25.08"
    env:
      - name: LOG_LEVEL
        value: INFO
    networkConfig:
      - name: NCCL_IB_SL
        value: "0"
      - name: UCX_TLS
        value: TCP
    ttlSecondsAfterFinished: 3600
    timeout: 3600
    tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"

NIM Resources

NeMo Resources

Overview

Spec Fields

Status Fields

Example

Build docs developers (and LLMs) love

NIM Resources

NeMo Resources

​Overview

​Spec Fields

​Status Fields

​Example

Build docs developers (and LLMs) love

Overview

Spec Fields

Status Fields

Example