Skip to main content

Overview

Charts support Prometheus Operator monitoring through two custom resources:
  • ServiceMonitor: Configures Prometheus scraping
  • PrometheusRule: Defines alerting rules
Both are optional and disabled by default.

ServiceMonitor

Basic Configuration

Enable ServiceMonitor to automatically configure Prometheus to scrape your application:
serviceMonitor:
  enabled: true
This creates a ServiceMonitor resource that tells Prometheus how to scrape metrics from your service.

Complete Configuration

serviceMonitor:
  enabled: true
  interval: 30s
  timeout: 10s

Configuration Options

serviceMonitor:
  enabled: true
  # How often to scrape
  interval: 30s
  # Scrape timeout
  timeout: 10s
  # Metrics endpoint path
  telemetryPath: /metrics
Best Practices:
  • Use 30s interval for most applications
  • Set timeout < interval
  • Ensure metrics endpoint is ready before pod becomes ready

Template Reference

The ServiceMonitor template (from nginx chart):
templates/servicemonitor.yaml
{{- if .Values.serviceMonitor.enabled }}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: {{ template "nginx.fullname" . }}
  {{- if .Values.serviceMonitor.namespace }}
  namespace: {{ .Values.serviceMonitor.namespace }}
  {{- end }}
  {{- with .Values.serviceMonitor.labels }}
  labels:
    {{- toYaml . | nindent 4 }}
  {{- end }}
spec:
  endpoints:
  - port: http
    {{- if .Values.serviceMonitor.interval }}
    interval: {{ .Values.serviceMonitor.interval }}
    {{- end }}
    {{- if .Values.serviceMonitor.telemetryPath }}
    path: {{ .Values.serviceMonitor.telemetryPath }}
    {{- end }}
    {{- if .Values.serviceMonitor.timeout }}
    scrapeTimeout: {{ .Values.serviceMonitor.timeout }}
    {{- end }}
    {{- if .Values.serviceMonitor.metricRelabelings }}
    metricRelabelings:
      {{- toYaml .Values.serviceMonitor.metricRelabelings | nindent 4 }}
    {{- end }}
    {{- if .Values.serviceMonitor.relabelings }}
    relabelings:
      {{- toYaml .Values.serviceMonitor.relabelings | nindent 4 }}
    {{- end }}
  jobLabel: {{ template "nginx.fullname" . }}
  namespaceSelector:
    matchNames:
    - {{ .Release.Namespace }}
  selector:
    matchLabels:
      {{- include "nginx.selectorLabels" . | nindent 6 }}
  {{- if .Values.serviceMonitor.targetLabels }}
  targetLabels:
    {{- range .Values.serviceMonitor.targetLabels }}
    - {{ . }}
    {{- end }}
  {{- end }}
{{- end }}

PrometheusRule

Basic Configuration

prometheusRule:
  enabled: true
  namespace: monitoring  # Optional
  additionalLabels:
    prometheus: kube-prometheus
  rules:
    - alert: HighErrorRate
      expr: |
        rate(http_requests_total{status=~"5.."}[5m]) > 0.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value }} requests per second"

Alert Rule Examples

prometheusRule:
  enabled: true
  additionalLabels:
    prometheus: kube-prometheus
  rules:
    - alert: ApplicationDown
      expr: up{job="my-app"} == 0
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Application {{ $labels.instance }} is down"
        description: "{{ $labels.job }} has been down for more than 2 minutes"

Multiple Alert Rules

prometheusRule:
  enabled: true
  namespace: monitoring
  additionalLabels:
    prometheus: kube-prometheus
    team: platform
  rules:
    # Critical alerts
    - alert: ServiceDown
      expr: up{job="my-service"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Service {{ $labels.instance }} is down"
        runbook_url: "https://wiki.example.com/runbooks/service-down"
    
    # Warning alerts
    - alert: HighLatency
      expr: |
        histogram_quantile(0.99, 
          rate(http_request_duration_seconds_bucket[5m])
        ) > 1.0
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High latency detected"
        description: "99th percentile latency is {{ $value }}s"
    
    - alert: LowThroughput
      expr: |
        sum(rate(http_requests_total[5m])) < 10
      for: 15m
      labels:
        severity: info
      annotations:
        summary: "Low request throughput"
        description: "Only {{ $value }} requests per second"

Severity Levels

rules:
  - alert: DatabaseDown
    expr: up{job="database"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Database is down - immediate action required"
Use for:
  • Service outages
  • Data loss risks
  • Security breaches
  • Immediate response needed

Alert Best Practices

# Bad - vague
- alert: Problem
  expr: metric > threshold

# Good - specific
- alert: HighAPIErrorRate
  expr: rate(api_errors_total[5m]) > 10
annotations:
  summary: "High error rate on {{ $labels.service }}"
  description: |
    Error rate: {{ $value | humanize }} errors/sec
    Instance: {{ $labels.instance }}
    Namespace: {{ $labels.namespace }}
  runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
  dashboard_url: "https://grafana.example.com/d/xyz"
# Bad - alerts on every scrape
- alert: HighCPU
  expr: cpu_usage > 0.8

# Good - sustained high CPU
- alert: HighCPU
  expr: cpu_usage > 0.8
  for: 10m  # Must be true for 10 minutes
# Bad - internal metric
- alert: HighGoroutineCount
  expr: go_goroutines > 10000

# Good - user-facing impact
- alert: HighRequestLatency
  expr: |
    histogram_quantile(0.99, 
      rate(http_request_duration_seconds_bucket[5m])
    ) > 1.0

Template Reference

The PrometheusRule template (from nginx chart):
templates/prometheusrule.yaml
{{- if .Values.prometheusRule.enabled }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: {{ template "nginx.fullname" . }}
  {{- with .Values.prometheusRule.namespace }}
  namespace: {{ . }}
  {{- end }}
  labels:
    {{- include "nginx.labels" . | nindent 4 }}
    {{- with .Values.prometheusRule.additionalLabels }}
    {{- toYaml . | nindent 4 }}
    {{- end }}
spec:
  {{- with .Values.prometheusRule.rules }}
  groups:
    - name: {{ template "nginx.name" $ }}
      rules: {{ tpl (toYaml .) $ | nindent 8 }}
  {{- end }}
{{- end }}

Common Metrics Patterns

Web Application Metrics

serviceMonitor:
  enabled: true
  interval: 15s

prometheusRule:
  enabled: true
  additionalLabels:
    prometheus: kube-prometheus
  rules:
    # Request rate
    - alert: LowRequestRate
      expr: |
        sum(rate(http_requests_total[5m])) < 1
      for: 10m
      labels:
        severity: warning
    
    # Error rate
    - alert: HighErrorRate
      expr: |
        (
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
        ) > 0.01
      for: 5m
      labels:
        severity: critical
    
    # Latency
    - alert: HighLatency
      expr: |
        histogram_quantile(0.95,
          rate(http_request_duration_seconds_bucket[5m])
        ) > 0.5
      for: 10m
      labels:
        severity: warning

Exporter Metrics

# Example from prometheus-memcached-exporter
serviceMonitor:
  enabled: true
  interval: 30s
  telemetryPath: /metrics

prometheusRule:
  enabled: true
  additionalLabels:
    prometheus: kube-prometheus
  rules:
    - alert: MemcachedDown
      expr: up{job="memcached-exporter"} == 0
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Memcached instance {{ $labels.instance }} down"
    
    - alert: MemcachedHighEvictionRate
      expr: |
        rate(memcached_items_evicted_total[5m]) > 100
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High eviction rate on {{ $labels.instance }}"
        description: "Evicting {{ $value }} items per second"
    
    - alert: MemcachedLowHitRate
      expr: |
        (
          rate(memcached_commands_total{command="get",status="hit"}[5m])
          /
          rate(memcached_commands_total{command="get"}[5m])
        ) < 0.8
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "Low cache hit rate"
        description: "Hit rate is {{ $value | humanizePercentage }}"

Database Metrics

serviceMonitor:
  enabled: true
  interval: 30s

prometheusRule:
  enabled: true
  additionalLabels:
    prometheus: kube-prometheus
  rules:
    - alert: DatabaseConnectionPoolExhausted
      expr: |
        (
          sum(db_connections_active)
          /
          sum(db_connections_max)
        ) > 0.9
      for: 5m
      labels:
        severity: critical
    
    - alert: SlowQueries
      expr: |
        rate(db_query_duration_seconds_sum[5m])
        /
        rate(db_query_duration_seconds_count[5m]) > 1.0
      for: 10m
      labels:
        severity: warning
    
    - alert: DatabaseReplicationLag
      expr: db_replication_lag_seconds > 30
      for: 5m
      labels:
        severity: warning

Complete Production Example

production-monitoring.yaml
# ServiceMonitor configuration
serviceMonitor:
  enabled: true
  namespace: monitoring
  interval: 30s
  timeout: 10s
  telemetryPath: /metrics
  labels:
    prometheus: kube-prometheus
    release: prometheus-operator
    team: platform
  relabelings:
    - sourceLabels: [__meta_kubernetes_pod_name]
      targetLabel: pod
    - sourceLabels: [__meta_kubernetes_namespace]
      targetLabel: namespace
  targetLabels:
    - app
    - version

# PrometheusRule configuration  
prometheusRule:
  enabled: true
  namespace: monitoring
  additionalLabels:
    prometheus: kube-prometheus
    team: platform
  rules:
    # Service availability
    - alert: ServiceDown
      expr: up{job="my-service"} == 0
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Service {{ $labels.instance }} is down"
        description: "{{ $labels.job }} has been unavailable for 2 minutes"
        runbook_url: "https://wiki.example.com/runbooks/service-down"
    
    # Error rate
    - alert: HighErrorRate
      expr: |
        (
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)
        ) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate on {{ $labels.job }}"
        description: "Error rate is {{ $value | humanizePercentage }}"
    
    # Latency
    - alert: HighLatency
      expr: |
        histogram_quantile(0.99,
          rate(http_request_duration_seconds_bucket[5m])
        ) > 1.0
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High request latency detected"
        description: "99th percentile is {{ $value }}s"
    
    # Resource usage
    - alert: HighMemoryUsage
      expr: |
        (
          container_memory_working_set_bytes{container="my-app"}
          /
          container_spec_memory_limit_bytes{container="my-app"}
        ) > 0.9
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High memory on {{ $labels.pod }}"
        description: "Memory usage at {{ $value | humanizePercentage }}"
    
    - alert: HighCPUUsage
      expr: |
        rate(container_cpu_usage_seconds_total{container="my-app"}[5m]) > 0.8
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High CPU on {{ $labels.pod }}"
        description: "CPU usage at {{ $value | humanize }} cores"

Verification

Check ServiceMonitor

# List ServiceMonitors
kubectl get servicemonitor -A

# Describe ServiceMonitor
kubectl describe servicemonitor <name> -n <namespace>

# Check if Prometheus discovered the target
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090/targets

Check PrometheusRule

# List PrometheusRules
kubectl get prometheusrule -A

# Describe PrometheusRule
kubectl describe prometheusrule <name> -n <namespace>

# Check if rules loaded in Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090/rules

Test Alerts

# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

# Open http://localhost:9090/alerts to see alert status

# Port-forward to Alertmanager
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-alertmanager 9093:9093

# Open http://localhost:9093 to see firing alerts

Troubleshooting

Check labels match Prometheus selector:
# Get Prometheus serviceMonitorSelector
kubectl get prometheus -n monitoring -o yaml | grep -A 5 serviceMonitorSelector

# Ensure ServiceMonitor has matching labels
kubectl get servicemonitor <name> -n <namespace> -o yaml | grep -A 10 labels
Verify service and endpoints:
# Check service exists
kubectl get svc <service-name>

# Check endpoints
kubectl get endpoints <service-name>

# Test metrics endpoint
kubectl run curl --image=curlimages/curl -it --rm -- curl http://<service>:<port>/metrics
Check PrometheusRule syntax:
# Get the PrometheusRule
kubectl get prometheusrule <name> -n <namespace> -o yaml

# Check Prometheus logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus
Test the expression:
# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

# Open http://localhost:9090 and test the alert expression
# Check the 'for' duration hasn't prevented the alert from firing

Next Steps

Values Configuration

Learn about values.yaml basics

Customization

Advanced customization techniques

Ingress

Configure external access

Chart Reference

Complete observability guide

Build docs developers (and LLMs) love