Skip to main content
The NVIDIA NIM Operator provides comprehensive monitoring and observability features to help you track the health and performance of your NIM deployments.

Prometheus Metrics Integration

The operator exposes Prometheus metrics on port 8080 for monitoring operator health and custom resource status.

Operator Metrics

The operator automatically exposes the following metrics:
Metric Name: nimService_status_totalTracks the total number of NIMService instances by status:
  • Ready - Service is running and ready
  • NotReady - Service is running but not ready
  • Pending - Service is being created
  • Failed - Service has failed
  • Unknown - Status cannot be determined
Example query:
nimService_status_total{status="Ready"}
Metric Name: nimCache_status_totalTracks the total number of NIMCache instances by status:
  • Ready - Cache is ready
  • NotReady - Cache is not ready
  • InProgress - Caching in progress
  • Failed - Caching failed
  • PVCCreated - PVC created
  • Pending - Waiting to start
  • Started - Caching started
  • Unknown - Status cannot be determined
Example query:
nimCache_status_total{status="Ready"}
Metric Name: nimPipeline_status_totalTracks the total number of NIMPipeline instances by status:
  • Ready - Pipeline is ready
  • NotReady - Pipeline is not ready
  • Failed - Pipeline has failed
  • Unknown - Status cannot be determined
The operator also tracks metrics for NeMo services:
  • nemo_datastore_status_total - NeMo DataStore status
  • nemo_evaluator_status_total - NeMo Evaluator status
  • nemo_entitystore_status_total - NeMo EntityStore status
  • nemo_customizer_status_total - NeMo Customizer status
  • nemo_guardrail_status_total - NeMo Guardrail status

Accessing Metrics

The operator exposes metrics via a ClusterIP service:
kubectl get service k8s-nim-operator-metrics-service -n nim-operator
To access metrics locally:
kubectl port-forward -n nim-operator svc/k8s-nim-operator-metrics-service 8080:8080
curl http://localhost:8080/metrics

ServiceMonitor Configuration

The operator supports Prometheus Operator’s ServiceMonitor for automatic metrics discovery.

Enabling ServiceMonitor for NIMService

Add the following to your NIMService spec:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  # ... other configuration ...
  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels:
        prometheus: kube-prometheus
      annotations:
        prometheus.io/scrape: "true"
      interval: 30s
      scrapeTimeout: 10s
The serviceMonitor configuration requires the Prometheus Operator to be installed in your cluster.

ServiceMonitor Fields

FieldTypeDescription
enabledbooleanEnable or disable ServiceMonitor creation
additionalLabelsmap[string]stringAdditional labels for ServiceMonitor selection
annotationsmap[string]stringAnnotations to add to the ServiceMonitor
intervaldurationScrape interval (e.g., 30s, 1m)
scrapeTimeoutdurationScrape timeout (e.g., 10s)

Example ServiceMonitor

The operator creates a ServiceMonitor like this:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: meta-llama3-8b-instruct
  namespace: default
  labels:
    app.kubernetes.io/name: meta-llama3-8b-instruct
    prometheus: kube-prometheus
spec:
  selector:
    matchLabels:
      app: meta-llama3-8b-instruct
  endpoints:
  - port: metrics
    interval: 30s
    scrapeTimeout: 10s

OpenTelemetry Support

NeMo services (Customizer, Evaluator, etc.) support OpenTelemetry for distributed tracing and observability.

Configuring OpenTelemetry

1

Deploy OpenTelemetry Collector

Deploy an OpenTelemetry Collector in your cluster to receive traces, metrics, and logs.
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml
2

Configure NeMo Service with OpenTelemetry

Add OpenTelemetry configuration to your NeMo service:
apiVersion: apps.nvidia.com/v1alpha1
kind: NemoCustomizer
metadata:
  name: my-customizer
spec:
  # ... other configuration ...
  otel:
    enabled: true
    exporterOtlpEndpoint: http://otel-collector:4317
    logLevel: INFO
    exporterConfig:
      tracesExporter: otlp
      metricsExporter: otlp
      logsExporter: otlp
    excludedUrls:
      - health
      - metrics
    disableLogging: false
3

Verify OpenTelemetry Configuration

Check that the environment variables are set correctly:
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].env}' | jq .
Look for variables like:
  • OTEL_EXPORTER_OTLP_ENDPOINT
  • OTEL_TRACES_EXPORTER
  • OTEL_METRICS_EXPORTER
  • OTEL_LOGS_EXPORTER
  • OTEL_LOG_LEVEL

OpenTelemetry Configuration Options

FieldTypeDefaultDescription
enabledbooleanfalseEnable OpenTelemetry instrumentation
exporterOtlpEndpointstring-OTLP collector endpoint URL
logLevelstringINFOLog level (INFO, DEBUG)
exporterConfig.tracesExporterstringotlpTraces exporter (otlp, console, none)
exporterConfig.metricsExporterstringotlpMetrics exporter (otlp, console, none)
exporterConfig.logsExporterstringotlpLogs exporter (otlp, console, none)
excludedUrls[]string["health"]URLs to exclude from tracing
disableLoggingbooleanfalseDisable Python logging auto-instrumentation

Monitoring Resource Status

Checking Status Conditions

All NIM resources expose status conditions that indicate their current state.
kubectl get nimservice meta-llama3-8b-instruct -o jsonpath='{.status}' | jq .

Understanding Status States

  • Ready: Service is running with all replicas available
  • NotReady: Service is running but not all replicas are ready
  • Pending: Service is being created or waiting for resources
  • Failed: Service deployment has failed

Monitoring Commands

# List all NIMServices with their status
kubectl get nimservice -A -o custom-columns=\
  NAME:.metadata.name,\
  NAMESPACE:.metadata.namespace,\
  STATUS:.status.state,\
  REPLICAS:.status.availableReplicas

# List all NIMCaches with their status
kubectl get nimcache -A -o custom-columns=\
  NAME:.metadata.name,\
  NAMESPACE:.metadata.namespace,\
  STATUS:.status.state,\
  PVC:.status.pvc
# Watch NIMService status
kubectl get nimservice -w

# Watch with custom columns
kubectl get nimservice -w -o custom-columns=\
  NAME:.metadata.name,\
  STATUS:.status.state,\
  REPLICAS:.status.availableReplicas
# Get detailed conditions for a NIMService
kubectl get nimservice meta-llama3-8b-instruct \
  -o jsonpath='{.status.conditions}' | jq .

# Check specific condition
kubectl get nimservice meta-llama3-8b-instruct \
  -o jsonpath='{.status.conditions[?(@.type=="Ready")]}' | jq .

Grafana Dashboards

You can create Grafana dashboards to visualize operator metrics.

Example Dashboard Queries

# Total NIMServices by status
sum by (status) (nimService_status_total)

# NIMService availability rate
sum(nimService_status_total{status="Ready"}) / sum(nimService_status_total{status!="Unknown"})

# NIMCache success rate
sum(nimCache_status_total{status="Ready"}) / sum(nimCache_status_total{status=~"Ready|Failed"})

# Active NIMServices
nimService_status_total{status!="Unknown"}
Ensure your Prometheus instance has the appropriate scrape configuration to collect metrics from the operator’s metrics service.

Health Checks

The operator exposes health check endpoints:
# Health probe endpoint
kubectl port-forward -n nim-operator deployment/k8s-nim-operator-controller-manager 8081:8081
curl http://localhost:8081/healthz
curl http://localhost:8081/readyz

Best Practices

Set Appropriate Intervals

Configure scrape intervals based on your monitoring needs. Start with 30s and adjust based on load.

Use Label Selectors

Add labels to ServiceMonitors for easy filtering and grouping in Prometheus.

Monitor Cache Operations

Track NIMCache status to ensure model caching completes successfully.

Set Up Alerts

Create Prometheus alerts for critical status changes (e.g., Failed state).

Next Steps

Troubleshooting

Learn how to diagnose and fix common issues

Best Practices

Optimize your NIM deployments for production

Build docs developers (and LLMs) love