Monitoring and Observability

The NVIDIA NIM Operator provides comprehensive monitoring and observability features to help you track the health and performance of your NIM deployments.

Prometheus Metrics Integration

The operator exposes Prometheus metrics on port 8080 for monitoring operator health and custom resource status.

Operator Metrics

The operator automatically exposes the following metrics:

NIMService Status Metrics

Metric Name: nimService_status_totalTracks the total number of NIMService instances by status:

Ready - Service is running and ready
NotReady - Service is running but not ready
Pending - Service is being created
Failed - Service has failed
Unknown - Status cannot be determined

Example query:

nimService_status_total{status="Ready"}

NIMCache Status Metrics

Metric Name: nimCache_status_totalTracks the total number of NIMCache instances by status:

Ready - Cache is ready
NotReady - Cache is not ready
InProgress - Caching in progress
Failed - Caching failed
PVCCreated - PVC created
Pending - Waiting to start
Started - Caching started
Unknown - Status cannot be determined

Example query:

nimCache_status_total{status="Ready"}

NIMPipeline Status Metrics

Metric Name: nimPipeline_status_totalTracks the total number of NIMPipeline instances by status:

Ready - Pipeline is ready
NotReady - Pipeline is not ready
Failed - Pipeline has failed
Unknown - Status cannot be determined

NeMo Service Metrics

The operator also tracks metrics for NeMo services:

nemo_datastore_status_total - NeMo DataStore status
nemo_evaluator_status_total - NeMo Evaluator status
nemo_entitystore_status_total - NeMo EntityStore status
nemo_customizer_status_total - NeMo Customizer status
nemo_guardrail_status_total - NeMo Guardrail status

Accessing Metrics

The operator exposes metrics via a ClusterIP service:

kubectl get service k8s-nim-operator-metrics-service -n nim-operator

To access metrics locally:

kubectl port-forward -n nim-operator svc/k8s-nim-operator-metrics-service 8080:8080
curl http://localhost:8080/metrics

ServiceMonitor Configuration

The operator supports Prometheus Operator’s ServiceMonitor for automatic metrics discovery.

Enabling ServiceMonitor for NIMService

Add the following to your NIMService spec:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  # ... other configuration ...
  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels:
        prometheus: kube-prometheus
      annotations:
        prometheus.io/scrape: "true"
      interval: 30s
      scrapeTimeout: 10s

The serviceMonitor configuration requires the Prometheus Operator to be installed in your cluster.

ServiceMonitor Fields

Field	Type	Description
`enabled`	boolean	Enable or disable ServiceMonitor creation
`additionalLabels`	map[string]string	Additional labels for ServiceMonitor selection
`annotations`	map[string]string	Annotations to add to the ServiceMonitor
`interval`	duration	Scrape interval (e.g., `30s`, `1m`)
`scrapeTimeout`	duration	Scrape timeout (e.g., `10s`)

Example ServiceMonitor

The operator creates a ServiceMonitor like this:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: meta-llama3-8b-instruct
  namespace: default
  labels:
    app.kubernetes.io/name: meta-llama3-8b-instruct
    prometheus: kube-prometheus
spec:
  selector:
    matchLabels:
      app: meta-llama3-8b-instruct
  endpoints:
  - port: metrics
    interval: 30s
    scrapeTimeout: 10s

OpenTelemetry Support

NeMo services (Customizer, Evaluator, etc.) support OpenTelemetry for distributed tracing and observability.

Configuring OpenTelemetry

Deploy OpenTelemetry Collector

Deploy an OpenTelemetry Collector in your cluster to receive traces, metrics, and logs.

kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

Configure NeMo Service with OpenTelemetry

Add OpenTelemetry configuration to your NeMo service:

apiVersion: apps.nvidia.com/v1alpha1
kind: NemoCustomizer
metadata:
  name: my-customizer
spec:
  # ... other configuration ...
  otel:
    enabled: true
    exporterOtlpEndpoint: http://otel-collector:4317
    logLevel: INFO
    exporterConfig:
      tracesExporter: otlp
      metricsExporter: otlp
      logsExporter: otlp
    excludedUrls:
      - health
      - metrics
    disableLogging: false

Verify OpenTelemetry Configuration

Check that the environment variables are set correctly:

kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].env}' | jq .

Look for variables like:

OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_TRACES_EXPORTER
OTEL_METRICS_EXPORTER
OTEL_LOGS_EXPORTER
OTEL_LOG_LEVEL

OpenTelemetry Configuration Options

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable OpenTelemetry instrumentation
`exporterOtlpEndpoint`	string	-	OTLP collector endpoint URL
`logLevel`	string	`INFO`	Log level (`INFO`, `DEBUG`)
`exporterConfig.tracesExporter`	string	`otlp`	Traces exporter (`otlp`, `console`, `none`)
`exporterConfig.metricsExporter`	string	`otlp`	Metrics exporter (`otlp`, `console`, `none`)
`exporterConfig.logsExporter`	string	`otlp`	Logs exporter (`otlp`, `console`, `none`)
`excludedUrls`	[]string	`["health"]`	URLs to exclude from tracing
`disableLogging`	boolean	`false`	Disable Python logging auto-instrumentation

Monitoring Resource Status

Checking Status Conditions

All NIM resources expose status conditions that indicate their current state.

kubectl get nimservice meta-llama3-8b-instruct -o jsonpath='{.status}' | jq .

Understanding Status States

NIMService
NIMCache
NIMPipeline

Ready: Service is running with all replicas available
NotReady: Service is running but not all replicas are ready
Pending: Service is being created or waiting for resources
Failed: Service deployment has failed

Monitoring Commands

List All Resources with Status

# List all NIMServices with their status
kubectl get nimservice -A -o custom-columns=\
  NAME:.metadata.name,\
  NAMESPACE:.metadata.namespace,\
  STATUS:.status.state,\
  REPLICAS:.status.availableReplicas

# List all NIMCaches with their status
kubectl get nimcache -A -o custom-columns=\
  NAME:.metadata.name,\
  NAMESPACE:.metadata.namespace,\
  STATUS:.status.state,\
  PVC:.status.pvc

Watch Resource Status Changes

# Watch NIMService status
kubectl get nimservice -w

# Watch with custom columns
kubectl get nimservice -w -o custom-columns=\
  NAME:.metadata.name,\
  STATUS:.status.state,\
  REPLICAS:.status.availableReplicas

Check Detailed Conditions

# Get detailed conditions for a NIMService
kubectl get nimservice meta-llama3-8b-instruct \
  -o jsonpath='{.status.conditions}' | jq .

# Check specific condition
kubectl get nimservice meta-llama3-8b-instruct \
  -o jsonpath='{.status.conditions[?(@.type=="Ready")]}' | jq .

Grafana Dashboards

You can create Grafana dashboards to visualize operator metrics.

Example Dashboard Queries

# Total NIMServices by status
sum by (status) (nimService_status_total)

# NIMService availability rate
sum(nimService_status_total{status="Ready"}) / sum(nimService_status_total{status!="Unknown"})

# NIMCache success rate
sum(nimCache_status_total{status="Ready"}) / sum(nimCache_status_total{status=~"Ready|Failed"})

# Active NIMServices
nimService_status_total{status!="Unknown"}

Ensure your Prometheus instance has the appropriate scrape configuration to collect metrics from the operator’s metrics service.

Health Checks

The operator exposes health check endpoints:

# Health probe endpoint
kubectl port-forward -n nim-operator deployment/k8s-nim-operator-controller-manager 8081:8081
curl http://localhost:8081/healthz
curl http://localhost:8081/readyz

Best Practices

Set Appropriate Intervals

Configure scrape intervals based on your monitoring needs. Start with 30s and adjust based on load.

Use Label Selectors

Add labels to ServiceMonitors for easy filtering and grouping in Prometheus.

Monitor Cache Operations

Track NIMCache status to ensure model caching completes successfully.

Set Up Alerts

Create Prometheus alerts for critical status changes (e.g., Failed state).

Next Steps

Troubleshooting

Learn how to diagnose and fix common issues

Best Practices

Optimize your NIM deployments for production

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

Prometheus Metrics Integration

Operator Metrics

Accessing Metrics

ServiceMonitor Configuration

Enabling ServiceMonitor for NIMService

ServiceMonitor Fields

Example ServiceMonitor

OpenTelemetry Support

Configuring OpenTelemetry

OpenTelemetry Configuration Options

Monitoring Resource Status

Checking Status Conditions

Understanding Status States

Monitoring Commands

Grafana Dashboards

Example Dashboard Queries

Health Checks

Best Practices

Set Appropriate Intervals

Use Label Selectors

Monitor Cache Operations

Set Up Alerts

Next Steps

Troubleshooting

Best Practices

Build docs developers (and LLMs) love

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

​Prometheus Metrics Integration

​Operator Metrics

​Accessing Metrics

​ServiceMonitor Configuration

​Enabling ServiceMonitor for NIMService

​ServiceMonitor Fields

​Example ServiceMonitor

​OpenTelemetry Support

​Configuring OpenTelemetry

​OpenTelemetry Configuration Options

​Monitoring Resource Status

​Checking Status Conditions

​Understanding Status States

​Monitoring Commands

​Grafana Dashboards

​Example Dashboard Queries

​Health Checks

​Best Practices

Set Appropriate Intervals

Use Label Selectors

Monitor Cache Operations

Set Up Alerts

​Next Steps

Troubleshooting

Best Practices

Build docs developers (and LLMs) love

Prometheus Metrics Integration

Operator Metrics

Accessing Metrics

ServiceMonitor Configuration

Enabling ServiceMonitor for NIMService

ServiceMonitor Fields

Example ServiceMonitor

OpenTelemetry Support

Configuring OpenTelemetry

OpenTelemetry Configuration Options

Monitoring Resource Status

Checking Status Conditions

Understanding Status States

Monitoring Commands

Grafana Dashboards

Example Dashboard Queries

Health Checks

Best Practices

Next Steps