Overview
NemoCustomizer enables you to:- Fine-tune foundation models on proprietary datasets
- Use LoRA/PEFT for memory-efficient training
- Orchestrate multi-GPU training jobs with Volcano or Run.AI
- Track experiments with Weights & Biases
- Store and version adapters in NemoEntitystore
- Serve multiple fine-tuned adapters from a single NIM instance
When to Use NemoCustomizer
Domain Adaptation
Adapt general models to specialized domains like healthcare, legal, or finance
Instruction Tuning
Train models to follow specific instruction formats or conversation styles
Style Transfer
Customize output style, tone, or formatting for brand consistency
Multi-Tenancy
Create customer-specific adapters for SaaS deployments
Architecture
NemoCustomizer consists of several components:- API Service: REST API for managing customization jobs
- Training Jobs: Kubernetes Jobs/VolcanoJobs for model training
- Model Storage: PVC for base models and training artifacts
- Database: PostgreSQL for job metadata and state
- Datastore Integration: Fetch training datasets
- Entitystore Integration: Store trained adapters
Configuration
Complete Example
Key Configuration Fields
Training job scheduler. Options:
volcano or runai.PostgreSQL connection configuration for storing customization job metadata.
NemoEntitystore service URL for uploading trained adapters.
NemoDatastore service URL for fetching training datasets.
MLflow tracking server URL for experiment tracking.
Weights & Biases configuration for experiment tracking and visualization.
Persistent volume for caching base models. Shared across training jobs.
Per-job workspace configuration. Automatically created for each training job.
NCCL/networking environment variables for multi-node training.
Integration with Services
NemoDatastore Integration
NemoCustomizer fetches training datasets from NemoDatastore:NemoEntitystore Integration
Trained adapters are automatically uploaded to NemoEntitystore:MLflow Tracking
Training metrics are logged to MLflow:Weights & Biases
For advanced experiment tracking:Training Job Schedulers
Volcano Scheduler
Default scheduler for Kubernetes-native gang scheduling:Run.AI Scheduler
For Run.AI-managed clusters:Storage Configuration
Model PVC
Shared storage for base models:Use
ReadWriteMany if running multi-node training jobs on different nodes.Workspace PVC
Per-job workspace (automatically created):API Usage
Create Customization Job
Monitor Job Status
List Customizations
Best Practices
Resource Allocation
Resource Allocation
- Use dedicated GPU nodes for training jobs
- Configure appropriate node selectors and tolerations
- Set reasonable timeout values to prevent stuck jobs
- Monitor GPU utilization and adjust batch sizes
Data Management
Data Management
- Use fast storage (NVMe/SSD) for model PVCs
- Pre-download large models to avoid repeated downloads
- Clean up workspace PVCs after job completion
- Version your datasets in NemoDatastore
Experiment Tracking
Experiment Tracking
- Use descriptive names for customization jobs
- Tag experiments with metadata (model, dataset, purpose)
- Monitor training metrics in W&B or MLflow
- Keep records of successful hyperparameter configurations
Production Deployment
Production Deployment
- Run multiple replicas of the API service for HA
- Use PostgreSQL with backups for metadata
- Configure OpenTelemetry for observability
- Implement proper secret rotation for credentials
Troubleshooting
Training Job Fails to Start
Training Job Fails to Start
Check:
- GPU node availability and labels
- PVC creation and mounting
- NGC pull secrets are valid
- Volcano/Run.AI scheduler is running
Out of Memory Errors
Out of Memory Errors
Solutions:
- Reduce batch size in hyperparameters
- Increase GPU memory by using larger instance types
- Enable gradient checkpointing
- Use smaller LoRA rank values
Slow Training Performance
Slow Training Performance
Optimize:
- Use NVMe storage for model PVC
- Configure NCCL settings for your network
- Check for CPU bottlenecks in data loading
- Use mixed precision training (automatic in NeMo)
Next Steps
Deploy Entitystore
Set up adapter storage and serving
Configure NIM
Enable dynamic LoRA loading in NIM
Setup Evaluation
Evaluate your fine-tuned models
API Reference
Detailed API documentation