NemoCustomizer

NemoCustomizer is a production-ready service for fine-tuning large language models using Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA. It provides a scalable API for creating, managing, and deploying customized model adapters.

Overview

NemoCustomizer enables you to:

Fine-tune foundation models on proprietary datasets
Use LoRA/PEFT for memory-efficient training
Orchestrate multi-GPU training jobs with Volcano or Run.AI
Track experiments with Weights & Biases
Store and version adapters in NemoEntitystore
Serve multiple fine-tuned adapters from a single NIM instance

When to Use NemoCustomizer

Domain Adaptation

Adapt general models to specialized domains like healthcare, legal, or finance

Instruction Tuning

Train models to follow specific instruction formats or conversation styles

Style Transfer

Customize output style, tone, or formatting for brand consistency

Multi-Tenancy

Create customer-specific adapters for SaaS deployments

Architecture

NemoCustomizer consists of several components:

API Service: REST API for managing customization jobs
Training Jobs: Kubernetes Jobs/VolcanoJobs for model training
Model Storage: PVC for base models and training artifacts
Database: PostgreSQL for job metadata and state
Datastore Integration: Fetch training datasets
Entitystore Integration: Store trained adapters

Configuration

Complete Example

apiVersion: apps.nvidia.com/v1alpha1
kind: NemoCustomizer
metadata:
  name: nemocustomizer-sample
  namespace: nemo
spec:
  # Container image configuration
  image:
    repository: nvcr.io/nvidia/nemo-microservices/customizer-api
    tag: "25.08"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  
  # API service exposure
  expose:
    service:
      type: ClusterIP
      port: 8000
  
  # Replica configuration
  replicas: 1
  
  # Scheduler for training jobs (volcano or runai)
  scheduler:
    type: "volcano"
  
  # PostgreSQL database connection
  databaseConfig:
    credentials:
      user: ncsuser
      secretName: customizer-pg-existing-secret
      passwordKey: password
    host: customizer-pg-postgresql.nemo.svc.cluster.local
    port: 5432
    databaseName: ncsdb
  
  # NeMo EntityStore endpoint for storing adapters
  entitystore:
    endpoint: http://nemoentitystore-sample.nemo.svc.cluster.local:8000
  
  # NeMo DataStore endpoint for fetching datasets
  datastore:
    endpoint: http://nemodatastore-sample.nemo.svc.cluster.local:8000
  
  # MLflow tracking server
  mlflow:
    endpoint: http://mlflow-tracking.nemo.svc.cluster.local:80
  
  # Weights & Biases configuration
  wandb:
    secretName: wandb-secret
    apiKeyKey: apiKey
    encryptionKey: encryptionKey
  
  # OpenTelemetry tracing
  otel:
    enabled: true
    exporterOtlpEndpoint: http://customizer-otel-opentelemetry-collector.nemo.svc.cluster.local:4317
  
  # Data store CLI tools image
  nemoDatastoreTools:
    image: nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.08
  
  # Model download jobs configuration
  modelDownloadJobs:
    image: "nvcr.io/nvidia/nemo-microservices/customizer-api:25.08"
    ngcAPISecret:
      name: ngc-api-secret
      key: "NGC_API_KEY"
    securityContext:
      fsGroup: 1000
      runAsNonRoot: true
      runAsUser: 1000
      runAsGroup: 1000
    ttlSecondsAfterFinished: 600
    pollIntervalSeconds: 15
  
  # Model configuration ConfigMap
  modelConfig:
    name: nemo-model-config
  
  # Training job configuration
  trainingConfig:
    configMap:
      name: nemo-training-config
    
    # Base model storage PVC
    modelPVC:
      create: true
      name: finetuning-ms-models-pvc
      storageClass: ""
      volumeAccessMode: ReadWriteOnce
      size: 50Gi
    
    # Per-job workspace PVC
    workspacePVC:
      storageClass: "local-path"
      volumeAccessMode: ReadWriteOnce
      size: 10Gi
      mountPath: /pvc/workspace
    
    # Training container image
    image:
      repository: nvcr.io/nvidia/nemo-microservices/customizer
      tag: "25.08"
    
    # Environment variables for training
    env:
      - name: LOG_LEVEL
        value: INFO
    
    # Multi-node networking configuration
    networkConfig:
      - name: NCCL_IB_SL
        value: "0"
      - name: NCCL_IB_TC
        value: "41"
      - name: UCX_TLS
        value: TCP
      - name: UCX_NET_DEVICES
        value: eth0
    
    # Job lifecycle
    ttlSecondsAfterFinished: 3600
    timeout: 3600
    
    # GPU node configuration
    tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"

Key Configuration Fields

spec.scheduler.type

string

default:"volcano"

Training job scheduler. Options: volcano or runai.

spec.databaseConfig

object

required

PostgreSQL connection configuration for storing customization job metadata.

spec.entitystore.endpoint

string

required

NemoEntitystore service URL for uploading trained adapters.

spec.datastore.endpoint

string

required

NemoDatastore service URL for fetching training datasets.

spec.mlflow.endpoint

string

required

MLflow tracking server URL for experiment tracking.

spec.wandb

object

Weights & Biases configuration for experiment tracking and visualization.

spec.trainingConfig.modelPVC

object

required

Persistent volume for caching base models. Shared across training jobs.

spec.trainingConfig.workspacePVC

object

required

Per-job workspace configuration. Automatically created for each training job.

spec.trainingConfig.networkConfig

array

NCCL/networking environment variables for multi-node training.

Integration with Services

NemoDatastore Integration

NemoCustomizer fetches training datasets from NemoDatastore:

datastore:
  endpoint: http://nemodatastore-sample.nemo.svc.cluster.local:8000

Datasets are referenced in customization requests and downloaded by training jobs.

NemoEntitystore Integration

Trained adapters are automatically uploaded to NemoEntitystore:

entitystore:
  endpoint: http://nemoentitystore-sample.nemo.svc.cluster.local:8000

Once uploaded, adapters can be served by NIM instances configured to poll the entitystore.

MLflow Tracking

Training metrics are logged to MLflow:

mlflow:
  endpoint: http://mlflow-tracking.nemo.svc.cluster.local:80

Weights & Biases

For advanced experiment tracking:

Create W&B Secret

kubectl create secret generic wandb-secret \
  --from-literal=apiKey=<YOUR_WANDB_API_KEY> \
  --from-literal=encryptionKey=<RANDOM_KEY> \
  -n nemo

Configure W&B in Spec

wandb:
  secretName: wandb-secret
  apiKeyKey: apiKey
  encryptionKey: encryptionKey
  entity: your-team
  projectName: nemo-customization

Training Job Schedulers

Volcano Scheduler

Default scheduler for Kubernetes-native gang scheduling:

scheduler:
  type: "volcano"

Ensure Volcano is installed:

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

Run.AI Scheduler

For Run.AI-managed clusters:

scheduler:
  type: "runai"

trainingConfig:
  runaiQueue: default

Storage Configuration

Model PVC

Shared storage for base models:

trainingConfig:
  modelPVC:
    create: true
    name: finetuning-ms-models-pvc
    storageClass: "fast-ssd"  # Use fast storage for models
    volumeAccessMode: ReadWriteOnce
    size: 50Gi

Use ReadWriteMany if running multi-node training jobs on different nodes.

Workspace PVC

Per-job workspace (automatically created):

trainingConfig:
  workspacePVC:
    storageClass: "local-path"
    volumeAccessMode: ReadWriteOnce
    size: 10Gi
    mountPath: /pvc/workspace

API Usage

Create Customization Job

curl -X POST http://nemocustomizer-sample.nemo.svc.cluster.local:8000/v1/customizations \
  -H "Content-Type: application/json" \
  -d '{
    "name": "finance-model-v1",
    "model": "meta/llama-3.1-8b-instruct",
    "dataset": "finance-qa-dataset",
    "hyperparameters": {
      "lora_rank": 16,
      "lora_alpha": 32,
      "learning_rate": 2e-4,
      "num_epochs": 3
    }
  }'

Monitor Job Status

curl http://nemocustomizer-sample.nemo.svc.cluster.local:8000/v1/customizations/finance-model-v1

List Customizations

curl http://nemocustomizer-sample.nemo.svc.cluster.local:8000/v1/customizations

Best Practices

Resource Allocation

Use dedicated GPU nodes for training jobs
Configure appropriate node selectors and tolerations
Set reasonable timeout values to prevent stuck jobs
Monitor GPU utilization and adjust batch sizes

Data Management

Use fast storage (NVMe/SSD) for model PVCs
Pre-download large models to avoid repeated downloads
Clean up workspace PVCs after job completion
Version your datasets in NemoDatastore

Experiment Tracking

Use descriptive names for customization jobs
Tag experiments with metadata (model, dataset, purpose)
Monitor training metrics in W&B or MLflow
Keep records of successful hyperparameter configurations

Production Deployment

Run multiple replicas of the API service for HA
Use PostgreSQL with backups for metadata
Configure OpenTelemetry for observability
Implement proper secret rotation for credentials

Troubleshooting

Training Job Fails to Start

Check:

GPU node availability and labels
PVC creation and mounting
NGC pull secrets are valid
Volcano/Run.AI scheduler is running

Out of Memory Errors

Solutions:

Reduce batch size in hyperparameters
Increase GPU memory by using larger instance types
Enable gradient checkpointing
Use smaller LoRA rank values

Slow Training Performance

Optimize:

Use NVMe storage for model PVC
Configure NCCL settings for your network
Check for CPU bottlenecks in data loading
Use mixed precision training (automatic in NeMo)

Next Steps

Deploy Entitystore

Set up adapter storage and serving

Configure NIM

Enable dynamic LoRA loading in NIM

Setup Evaluation

Evaluate your fine-tuned models

API Reference

Detailed API documentation

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

Overview

When to Use NemoCustomizer

Domain Adaptation

Instruction Tuning

Style Transfer

Multi-Tenancy

Architecture

Configuration

Complete Example

Key Configuration Fields

Integration with Services

NemoDatastore Integration

NemoEntitystore Integration

MLflow Tracking

Weights & Biases

Training Job Schedulers

Volcano Scheduler

Run.AI Scheduler

Storage Configuration

Model PVC

Workspace PVC

API Usage

Create Customization Job

Monitor Job Status

List Customizations

Best Practices

Troubleshooting

Next Steps

Deploy Entitystore

Configure NIM

Setup Evaluation

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

​Overview

​When to Use NemoCustomizer

Domain Adaptation

Instruction Tuning

Style Transfer

Multi-Tenancy

​Architecture

​Configuration

​Complete Example

​Key Configuration Fields

​Integration with Services

​NemoDatastore Integration

​NemoEntitystore Integration

​MLflow Tracking

​Weights & Biases

​Training Job Schedulers

​Volcano Scheduler

​Run.AI Scheduler

​Storage Configuration

​Model PVC

​Workspace PVC

​API Usage

​Create Customization Job

​Monitor Job Status

​List Customizations

​Best Practices

​Troubleshooting

​Next Steps

Deploy Entitystore

Configure NIM

Setup Evaluation

API Reference

Build docs developers (and LLMs) love

Overview

When to Use NemoCustomizer

Architecture

Configuration

Complete Example

Key Configuration Fields

Integration with Services

NemoDatastore Integration

NemoEntitystore Integration

MLflow Tracking

Weights & Biases

Training Job Schedulers

Volcano Scheduler

Run.AI Scheduler

Storage Configuration

Model PVC

Workspace PVC

API Usage

Create Customization Job

Monitor Job Status

List Customizations

Best Practices

Troubleshooting

Next Steps