Monitoring Configuration

Overview

Charts support Prometheus Operator monitoring through two custom resources:

ServiceMonitor: Configures Prometheus scraping
PrometheusRule: Defines alerting rules

Both are optional and disabled by default.

ServiceMonitor

Basic Configuration

Enable ServiceMonitor to automatically configure Prometheus to scrape your application:

serviceMonitor:
  enabled: true

This creates a ServiceMonitor resource that tells Prometheus how to scrape metrics from your service.

Complete Configuration

serviceMonitor:
  enabled: true
  interval: 30s
  timeout: 10s

Configuration Options

Scraping
Namespace
Labels
Relabeling

serviceMonitor:
  enabled: true
  # How often to scrape
  interval: 30s
  # Scrape timeout
  timeout: 10s
  # Metrics endpoint path
  telemetryPath: /metrics

Best Practices:

Use 30s interval for most applications
Set timeout < interval
Ensure metrics endpoint is ready before pod becomes ready

serviceMonitor:
  enabled: true
  # Deploy ServiceMonitor to monitoring namespace
  namespace: monitoring

Why use a different namespace:

Centralized monitoring resources
Easier RBAC management
Matches Prometheus Operator namespace selector

serviceMonitor:
  enabled: true
  labels:
    prometheus: kube-prometheus
    release: prometheus-operator
    team: platform

Critical: Labels must match your Prometheus serviceMonitorSelector:

# Your Prometheus CR should have:
spec:
  serviceMonitorSelector:
    matchLabels:
      prometheus: kube-prometheus

serviceMonitor:
  enabled: true
  # Add/modify labels from Kubernetes metadata
  relabelings:
    - sourceLabels: [__meta_kubernetes_pod_name]
      targetLabel: pod
    - sourceLabels: [__meta_kubernetes_namespace]
      targetLabel: namespace
    - sourceLabels: [__meta_kubernetes_pod_node_name]
      targetLabel: node
  
  # Filter or modify metrics
  metricRelabelings:
    # Drop go runtime metrics
    - sourceLabels: [__name__]
      regex: 'go_.*'
      action: drop
    # Keep only specific metrics
    - sourceLabels: [__name__]
      regex: '(http_requests_total|http_request_duration_seconds)'
      action: keep
  
  # Transfer Service labels to targets
  targetLabels:
    - app
    - version
    - environment

Template Reference

The ServiceMonitor template (from nginx chart):

templates/servicemonitor.yaml

{{- if .Values.serviceMonitor.enabled }}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: {{ template "nginx.fullname" . }}
  {{- if .Values.serviceMonitor.namespace }}
  namespace: {{ .Values.serviceMonitor.namespace }}
  {{- end }}
  {{- with .Values.serviceMonitor.labels }}
  labels:
    {{- toYaml . | nindent 4 }}
  {{- end }}
spec:
  endpoints:
  - port: http
    {{- if .Values.serviceMonitor.interval }}
    interval: {{ .Values.serviceMonitor.interval }}
    {{- end }}
    {{- if .Values.serviceMonitor.telemetryPath }}
    path: {{ .Values.serviceMonitor.telemetryPath }}
    {{- end }}
    {{- if .Values.serviceMonitor.timeout }}
    scrapeTimeout: {{ .Values.serviceMonitor.timeout }}
    {{- end }}
    {{- if .Values.serviceMonitor.metricRelabelings }}
    metricRelabelings:
      {{- toYaml .Values.serviceMonitor.metricRelabelings | nindent 4 }}
    {{- end }}
    {{- if .Values.serviceMonitor.relabelings }}
    relabelings:
      {{- toYaml .Values.serviceMonitor.relabelings | nindent 4 }}
    {{- end }}
  jobLabel: {{ template "nginx.fullname" . }}
  namespaceSelector:
    matchNames:
    - {{ .Release.Namespace }}
  selector:
    matchLabels:
      {{- include "nginx.selectorLabels" . | nindent 6 }}
  {{- if .Values.serviceMonitor.targetLabels }}
  targetLabels:
    {{- range .Values.serviceMonitor.targetLabels }}
    - {{ . }}
    {{- end }}
  {{- end }}
{{- end }}

PrometheusRule

Basic Configuration

prometheusRule:
  enabled: true
  namespace: monitoring  # Optional
  additionalLabels:
    prometheus: kube-prometheus
  rules:
    - alert: HighErrorRate
      expr: |
        rate(http_requests_total{status=~"5.."}[5m]) > 0.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value }} requests per second"

Alert Rule Examples

prometheusRule:
  enabled: true
  additionalLabels:
    prometheus: kube-prometheus
  rules:
    - alert: ApplicationDown
      expr: up{job="my-app"} == 0
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Application {{ $labels.instance }} is down"
        description: "{{ $labels.job }} has been down for more than 2 minutes"

Multiple Alert Rules

prometheusRule:
  enabled: true
  namespace: monitoring
  additionalLabels:
    prometheus: kube-prometheus
    team: platform
  rules:
    # Critical alerts
    - alert: ServiceDown
      expr: up{job="my-service"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Service {{ $labels.instance }} is down"
        runbook_url: "https://wiki.example.com/runbooks/service-down"
    
    # Warning alerts
    - alert: HighLatency
      expr: |
        histogram_quantile(0.99, 
          rate(http_request_duration_seconds_bucket[5m])
        ) > 1.0
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High latency detected"
        description: "99th percentile latency is {{ $value }}s"
    
    - alert: LowThroughput
      expr: |
        sum(rate(http_requests_total[5m])) < 10
      for: 15m
      labels:
        severity: info
      annotations:
        summary: "Low request throughput"
        description: "Only {{ $value }} requests per second"

Severity Levels

Critical
Warning
Info

rules:
  - alert: DatabaseDown
    expr: up{job="database"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Database is down - immediate action required"

Use for:

Service outages
Data loss risks
Security breaches
Immediate response needed

rules:
  - alert: HighMemoryUsage
    expr: memory_usage > 0.85
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Memory usage is high - investigate soon"

Use for:

Degraded performance
Resource saturation
Potential future issues
Investigation needed

rules:
  - alert: NewDeployment
    expr: changes(deployment_version[5m]) > 0
    labels:
      severity: info
    annotations:
      summary: "New deployment detected"

Use for:

Informational events
Capacity planning
Trending data
Optional notifications

Alert Best Practices

Use Meaningful Alert Names

# Bad - vague
- alert: Problem
  expr: metric > threshold

# Good - specific
- alert: HighAPIErrorRate
  expr: rate(api_errors_total[5m]) > 10

Include Context in Annotations

annotations:
  summary: "High error rate on {{ $labels.service }}"
  description: |
    Error rate: {{ $value | humanize }} errors/sec
    Instance: {{ $labels.instance }}
    Namespace: {{ $labels.namespace }}
  runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
  dashboard_url: "https://grafana.example.com/d/xyz"

Use 'for' to Avoid Flapping

# Bad - alerts on every scrape
- alert: HighCPU
  expr: cpu_usage > 0.8

# Good - sustained high CPU
- alert: HighCPU
  expr: cpu_usage > 0.8
  for: 10m  # Must be true for 10 minutes

Alert on Symptoms, Not Causes

# Bad - internal metric
- alert: HighGoroutineCount
  expr: go_goroutines > 10000

# Good - user-facing impact
- alert: HighRequestLatency
  expr: |
    histogram_quantile(0.99, 
      rate(http_request_duration_seconds_bucket[5m])
    ) > 1.0

Template Reference

The PrometheusRule template (from nginx chart):

templates/prometheusrule.yaml

{{- if .Values.prometheusRule.enabled }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: {{ template "nginx.fullname" . }}
  {{- with .Values.prometheusRule.namespace }}
  namespace: {{ . }}
  {{- end }}
  labels:
    {{- include "nginx.labels" . | nindent 4 }}
    {{- with .Values.prometheusRule.additionalLabels }}
    {{- toYaml . | nindent 4 }}
    {{- end }}
spec:
  {{- with .Values.prometheusRule.rules }}
  groups:
    - name: {{ template "nginx.name" $ }}
      rules: {{ tpl (toYaml .) $ | nindent 8 }}
  {{- end }}
{{- end }}

Common Metrics Patterns

Web Application Metrics

serviceMonitor:
  enabled: true
  interval: 15s

prometheusRule:
  enabled: true
  additionalLabels:
    prometheus: kube-prometheus
  rules:
    # Request rate
    - alert: LowRequestRate
      expr: |
        sum(rate(http_requests_total[5m])) < 1
      for: 10m
      labels:
        severity: warning
    
    # Error rate
    - alert: HighErrorRate
      expr: |
        (
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
        ) > 0.01
      for: 5m
      labels:
        severity: critical
    
    # Latency
    - alert: HighLatency
      expr: |
        histogram_quantile(0.95,
          rate(http_request_duration_seconds_bucket[5m])
        ) > 0.5
      for: 10m
      labels:
        severity: warning

Exporter Metrics

# Example from prometheus-memcached-exporter
serviceMonitor:
  enabled: true
  interval: 30s
  telemetryPath: /metrics

prometheusRule:
  enabled: true
  additionalLabels:
    prometheus: kube-prometheus
  rules:
    - alert: MemcachedDown
      expr: up{job="memcached-exporter"} == 0
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Memcached instance {{ $labels.instance }} down"
    
    - alert: MemcachedHighEvictionRate
      expr: |
        rate(memcached_items_evicted_total[5m]) > 100
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High eviction rate on {{ $labels.instance }}"
        description: "Evicting {{ $value }} items per second"
    
    - alert: MemcachedLowHitRate
      expr: |
        (
          rate(memcached_commands_total{command="get",status="hit"}[5m])
          /
          rate(memcached_commands_total{command="get"}[5m])
        ) < 0.8
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "Low cache hit rate"
        description: "Hit rate is {{ $value | humanizePercentage }}"

Database Metrics

serviceMonitor:
  enabled: true
  interval: 30s

prometheusRule:
  enabled: true
  additionalLabels:
    prometheus: kube-prometheus
  rules:
    - alert: DatabaseConnectionPoolExhausted
      expr: |
        (
          sum(db_connections_active)
          /
          sum(db_connections_max)
        ) > 0.9
      for: 5m
      labels:
        severity: critical
    
    - alert: SlowQueries
      expr: |
        rate(db_query_duration_seconds_sum[5m])
        /
        rate(db_query_duration_seconds_count[5m]) > 1.0
      for: 10m
      labels:
        severity: warning
    
    - alert: DatabaseReplicationLag
      expr: db_replication_lag_seconds > 30
      for: 5m
      labels:
        severity: warning

Complete Production Example

production-monitoring.yaml

# ServiceMonitor configuration
serviceMonitor:
  enabled: true
  namespace: monitoring
  interval: 30s
  timeout: 10s
  telemetryPath: /metrics
  labels:
    prometheus: kube-prometheus
    release: prometheus-operator
    team: platform
  relabelings:
    - sourceLabels: [__meta_kubernetes_pod_name]
      targetLabel: pod
    - sourceLabels: [__meta_kubernetes_namespace]
      targetLabel: namespace
  targetLabels:
    - app
    - version

# PrometheusRule configuration  
prometheusRule:
  enabled: true
  namespace: monitoring
  additionalLabels:
    prometheus: kube-prometheus
    team: platform
  rules:
    # Service availability
    - alert: ServiceDown
      expr: up{job="my-service"} == 0
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Service {{ $labels.instance }} is down"
        description: "{{ $labels.job }} has been unavailable for 2 minutes"
        runbook_url: "https://wiki.example.com/runbooks/service-down"
    
    # Error rate
    - alert: HighErrorRate
      expr: |
        (
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)
        ) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate on {{ $labels.job }}"
        description: "Error rate is {{ $value | humanizePercentage }}"
    
    # Latency
    - alert: HighLatency
      expr: |
        histogram_quantile(0.99,
          rate(http_request_duration_seconds_bucket[5m])
        ) > 1.0
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High request latency detected"
        description: "99th percentile is {{ $value }}s"
    
    # Resource usage
    - alert: HighMemoryUsage
      expr: |
        (
          container_memory_working_set_bytes{container="my-app"}
          /
          container_spec_memory_limit_bytes{container="my-app"}
        ) > 0.9
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High memory on {{ $labels.pod }}"
        description: "Memory usage at {{ $value | humanizePercentage }}"
    
    - alert: HighCPUUsage
      expr: |
        rate(container_cpu_usage_seconds_total{container="my-app"}[5m]) > 0.8
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High CPU on {{ $labels.pod }}"
        description: "CPU usage at {{ $value | humanize }} cores"

Verification

Check ServiceMonitor

# List ServiceMonitors
kubectl get servicemonitor -A

# Describe ServiceMonitor
kubectl describe servicemonitor <name> -n <namespace>

# Check if Prometheus discovered the target
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090/targets

Check PrometheusRule

# List PrometheusRules
kubectl get prometheusrule -A

# Describe PrometheusRule
kubectl describe prometheusrule <name> -n <namespace>

# Check if rules loaded in Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090/rules

Test Alerts

# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

# Open http://localhost:9090/alerts to see alert status

# Port-forward to Alertmanager
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-alertmanager 9093:9093

# Open http://localhost:9093 to see firing alerts

Troubleshooting

ServiceMonitor Not Discovered

Check labels match Prometheus selector:

# Get Prometheus serviceMonitorSelector
kubectl get prometheus -n monitoring -o yaml | grep -A 5 serviceMonitorSelector

# Ensure ServiceMonitor has matching labels
kubectl get servicemonitor <name> -n <namespace> -o yaml | grep -A 10 labels

No Metrics Scraped

Verify service and endpoints:

# Check service exists
kubectl get svc <service-name>

# Check endpoints
kubectl get endpoints <service-name>

# Test metrics endpoint
kubectl run curl --image=curlimages/curl -it --rm -- curl http://<service>:<port>/metrics

Rules Not Loading

Check PrometheusRule syntax:

# Get the PrometheusRule
kubectl get prometheusrule <name> -n <namespace> -o yaml

# Check Prometheus logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus

Alerts Not Firing

Test the expression:

# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

# Open http://localhost:9090 and test the alert expression
# Check the 'for' duration hasn't prevented the alert from firing

Next Steps

Values Configuration

Learn about values.yaml basics

Customization

Advanced customization techniques

Ingress

Configure external access

Chart Reference

Complete observability guide

Get Started

Web Services

Monitoring & Exporters

Infrastructure

Applications

Configuration

Contributing

Overview

ServiceMonitor

Basic Configuration

Complete Configuration

Configuration Options

Template Reference

PrometheusRule

Basic Configuration

Alert Rule Examples

Multiple Alert Rules

Severity Levels

Alert Best Practices

Template Reference

Common Metrics Patterns

Web Application Metrics

Exporter Metrics

Database Metrics

Complete Production Example

Verification

Check ServiceMonitor

Check PrometheusRule

Test Alerts

Troubleshooting

Next Steps

Values Configuration

Customization

Ingress

Chart Reference

Build docs developers (and LLMs) love

Get Started

Web Services

Monitoring & Exporters

Infrastructure

Applications

Configuration

Contributing

​Overview

​ServiceMonitor

​Basic Configuration

​Complete Configuration

​Configuration Options

​Template Reference

​PrometheusRule

​Basic Configuration

​Alert Rule Examples

​Multiple Alert Rules

​Severity Levels

​Alert Best Practices

​Template Reference

​Common Metrics Patterns

​Web Application Metrics

​Exporter Metrics

​Database Metrics

​Complete Production Example

​Verification

​Check ServiceMonitor

​Check PrometheusRule

​Test Alerts

​Troubleshooting

​Next Steps

Values Configuration

Customization

Ingress

Chart Reference

Build docs developers (and LLMs) love

Overview

ServiceMonitor

Basic Configuration

Complete Configuration

Configuration Options

Template Reference

PrometheusRule

Basic Configuration

Alert Rule Examples

Multiple Alert Rules

Severity Levels

Alert Best Practices

Template Reference

Common Metrics Patterns

Web Application Metrics

Exporter Metrics

Database Metrics

Complete Production Example

Verification

Check ServiceMonitor

Check PrometheusRule

Test Alerts

Troubleshooting

Next Steps