Prometheus Metrics Integration
The operator exposes Prometheus metrics on port 8080 for monitoring operator health and custom resource status.Operator Metrics
The operator automatically exposes the following metrics:NIMService Status Metrics
NIMService Status Metrics
Metric Name:
nimService_status_totalTracks the total number of NIMService instances by status:Ready- Service is running and readyNotReady- Service is running but not readyPending- Service is being createdFailed- Service has failedUnknown- Status cannot be determined
NIMCache Status Metrics
NIMCache Status Metrics
Metric Name:
nimCache_status_totalTracks the total number of NIMCache instances by status:Ready- Cache is readyNotReady- Cache is not readyInProgress- Caching in progressFailed- Caching failedPVCCreated- PVC createdPending- Waiting to startStarted- Caching startedUnknown- Status cannot be determined
NIMPipeline Status Metrics
NIMPipeline Status Metrics
Metric Name:
nimPipeline_status_totalTracks the total number of NIMPipeline instances by status:Ready- Pipeline is readyNotReady- Pipeline is not readyFailed- Pipeline has failedUnknown- Status cannot be determined
NeMo Service Metrics
NeMo Service Metrics
The operator also tracks metrics for NeMo services:
nemo_datastore_status_total- NeMo DataStore statusnemo_evaluator_status_total- NeMo Evaluator statusnemo_entitystore_status_total- NeMo EntityStore statusnemo_customizer_status_total- NeMo Customizer statusnemo_guardrail_status_total- NeMo Guardrail status
Accessing Metrics
The operator exposes metrics via a ClusterIP service:ServiceMonitor Configuration
The operator supports Prometheus Operator’s ServiceMonitor for automatic metrics discovery.Enabling ServiceMonitor for NIMService
Add the following to your NIMService spec:The
serviceMonitor configuration requires the Prometheus Operator to be installed in your cluster.ServiceMonitor Fields
| Field | Type | Description |
|---|---|---|
enabled | boolean | Enable or disable ServiceMonitor creation |
additionalLabels | map[string]string | Additional labels for ServiceMonitor selection |
annotations | map[string]string | Annotations to add to the ServiceMonitor |
interval | duration | Scrape interval (e.g., 30s, 1m) |
scrapeTimeout | duration | Scrape timeout (e.g., 10s) |
Example ServiceMonitor
The operator creates a ServiceMonitor like this:OpenTelemetry Support
NeMo services (Customizer, Evaluator, etc.) support OpenTelemetry for distributed tracing and observability.Configuring OpenTelemetry
Deploy OpenTelemetry Collector
Deploy an OpenTelemetry Collector in your cluster to receive traces, metrics, and logs.
OpenTelemetry Configuration Options
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable OpenTelemetry instrumentation |
exporterOtlpEndpoint | string | - | OTLP collector endpoint URL |
logLevel | string | INFO | Log level (INFO, DEBUG) |
exporterConfig.tracesExporter | string | otlp | Traces exporter (otlp, console, none) |
exporterConfig.metricsExporter | string | otlp | Metrics exporter (otlp, console, none) |
exporterConfig.logsExporter | string | otlp | Logs exporter (otlp, console, none) |
excludedUrls | []string | ["health"] | URLs to exclude from tracing |
disableLogging | boolean | false | Disable Python logging auto-instrumentation |
Monitoring Resource Status
Checking Status Conditions
All NIM resources expose status conditions that indicate their current state.Understanding Status States
- NIMService
- NIMCache
- NIMPipeline
- Ready: Service is running with all replicas available
- NotReady: Service is running but not all replicas are ready
- Pending: Service is being created or waiting for resources
- Failed: Service deployment has failed
Monitoring Commands
List All Resources with Status
List All Resources with Status
Watch Resource Status Changes
Watch Resource Status Changes
Check Detailed Conditions
Check Detailed Conditions
Grafana Dashboards
You can create Grafana dashboards to visualize operator metrics.Example Dashboard Queries
Health Checks
The operator exposes health check endpoints:Best Practices
Set Appropriate Intervals
Configure scrape intervals based on your monitoring needs. Start with 30s and adjust based on load.
Use Label Selectors
Add labels to ServiceMonitors for easy filtering and grouping in Prometheus.
Monitor Cache Operations
Track NIMCache status to ensure model caching completes successfully.
Set Up Alerts
Create Prometheus alerts for critical status changes (e.g., Failed state).
Next Steps
Troubleshooting
Learn how to diagnose and fix common issues
Best Practices
Optimize your NIM deployments for production