Overview
Charts support Prometheus Operator monitoring through two custom resources:- ServiceMonitor: Configures Prometheus scraping
- PrometheusRule: Defines alerting rules
ServiceMonitor
Basic Configuration
Enable ServiceMonitor to automatically configure Prometheus to scrape your application:serviceMonitor:
enabled: true
Complete Configuration
serviceMonitor:
enabled: true
interval: 30s
timeout: 10s
Configuration Options
- Scraping
- Namespace
- Labels
- Relabeling
serviceMonitor:
enabled: true
# How often to scrape
interval: 30s
# Scrape timeout
timeout: 10s
# Metrics endpoint path
telemetryPath: /metrics
- Use 30s interval for most applications
- Set timeout < interval
- Ensure metrics endpoint is ready before pod becomes ready
serviceMonitor:
enabled: true
# Deploy ServiceMonitor to monitoring namespace
namespace: monitoring
- Centralized monitoring resources
- Easier RBAC management
- Matches Prometheus Operator namespace selector
serviceMonitor:
enabled: true
labels:
prometheus: kube-prometheus
release: prometheus-operator
team: platform
serviceMonitorSelector:# Your Prometheus CR should have:
spec:
serviceMonitorSelector:
matchLabels:
prometheus: kube-prometheus
serviceMonitor:
enabled: true
# Add/modify labels from Kubernetes metadata
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: namespace
- sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: node
# Filter or modify metrics
metricRelabelings:
# Drop go runtime metrics
- sourceLabels: [__name__]
regex: 'go_.*'
action: drop
# Keep only specific metrics
- sourceLabels: [__name__]
regex: '(http_requests_total|http_request_duration_seconds)'
action: keep
# Transfer Service labels to targets
targetLabels:
- app
- version
- environment
Template Reference
The ServiceMonitor template (from nginx chart):templates/servicemonitor.yaml
{{- if .Values.serviceMonitor.enabled }}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: {{ template "nginx.fullname" . }}
{{- if .Values.serviceMonitor.namespace }}
namespace: {{ .Values.serviceMonitor.namespace }}
{{- end }}
{{- with .Values.serviceMonitor.labels }}
labels:
{{- toYaml . | nindent 4 }}
{{- end }}
spec:
endpoints:
- port: http
{{- if .Values.serviceMonitor.interval }}
interval: {{ .Values.serviceMonitor.interval }}
{{- end }}
{{- if .Values.serviceMonitor.telemetryPath }}
path: {{ .Values.serviceMonitor.telemetryPath }}
{{- end }}
{{- if .Values.serviceMonitor.timeout }}
scrapeTimeout: {{ .Values.serviceMonitor.timeout }}
{{- end }}
{{- if .Values.serviceMonitor.metricRelabelings }}
metricRelabelings:
{{- toYaml .Values.serviceMonitor.metricRelabelings | nindent 4 }}
{{- end }}
{{- if .Values.serviceMonitor.relabelings }}
relabelings:
{{- toYaml .Values.serviceMonitor.relabelings | nindent 4 }}
{{- end }}
jobLabel: {{ template "nginx.fullname" . }}
namespaceSelector:
matchNames:
- {{ .Release.Namespace }}
selector:
matchLabels:
{{- include "nginx.selectorLabels" . | nindent 6 }}
{{- if .Values.serviceMonitor.targetLabels }}
targetLabels:
{{- range .Values.serviceMonitor.targetLabels }}
- {{ . }}
{{- end }}
{{- end }}
{{- end }}
PrometheusRule
Basic Configuration
prometheusRule:
enabled: true
namespace: monitoring # Optional
additionalLabels:
prometheus: kube-prometheus
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} requests per second"
Alert Rule Examples
prometheusRule:
enabled: true
additionalLabels:
prometheus: kube-prometheus
rules:
- alert: ApplicationDown
expr: up{job="my-app"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Application {{ $labels.instance }} is down"
description: "{{ $labels.job }} has been down for more than 2 minutes"
Multiple Alert Rules
prometheusRule:
enabled: true
namespace: monitoring
additionalLabels:
prometheus: kube-prometheus
team: platform
rules:
# Critical alerts
- alert: ServiceDown
expr: up{job="my-service"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
runbook_url: "https://wiki.example.com/runbooks/service-down"
# Warning alerts
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "99th percentile latency is {{ $value }}s"
- alert: LowThroughput
expr: |
sum(rate(http_requests_total[5m])) < 10
for: 15m
labels:
severity: info
annotations:
summary: "Low request throughput"
description: "Only {{ $value }} requests per second"
Severity Levels
- Critical
- Warning
- Info
rules:
- alert: DatabaseDown
expr: up{job="database"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database is down - immediate action required"
- Service outages
- Data loss risks
- Security breaches
- Immediate response needed
rules:
- alert: HighMemoryUsage
expr: memory_usage > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "Memory usage is high - investigate soon"
- Degraded performance
- Resource saturation
- Potential future issues
- Investigation needed
rules:
- alert: NewDeployment
expr: changes(deployment_version[5m]) > 0
labels:
severity: info
annotations:
summary: "New deployment detected"
- Informational events
- Capacity planning
- Trending data
- Optional notifications
Alert Best Practices
Use Meaningful Alert Names
Use Meaningful Alert Names
# Bad - vague
- alert: Problem
expr: metric > threshold
# Good - specific
- alert: HighAPIErrorRate
expr: rate(api_errors_total[5m]) > 10
Include Context in Annotations
Include Context in Annotations
annotations:
summary: "High error rate on {{ $labels.service }}"
description: |
Error rate: {{ $value | humanize }} errors/sec
Instance: {{ $labels.instance }}
Namespace: {{ $labels.namespace }}
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
dashboard_url: "https://grafana.example.com/d/xyz"
Use 'for' to Avoid Flapping
Use 'for' to Avoid Flapping
# Bad - alerts on every scrape
- alert: HighCPU
expr: cpu_usage > 0.8
# Good - sustained high CPU
- alert: HighCPU
expr: cpu_usage > 0.8
for: 10m # Must be true for 10 minutes
Alert on Symptoms, Not Causes
Alert on Symptoms, Not Causes
# Bad - internal metric
- alert: HighGoroutineCount
expr: go_goroutines > 10000
# Good - user-facing impact
- alert: HighRequestLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 1.0
Template Reference
The PrometheusRule template (from nginx chart):templates/prometheusrule.yaml
{{- if .Values.prometheusRule.enabled }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: {{ template "nginx.fullname" . }}
{{- with .Values.prometheusRule.namespace }}
namespace: {{ . }}
{{- end }}
labels:
{{- include "nginx.labels" . | nindent 4 }}
{{- with .Values.prometheusRule.additionalLabels }}
{{- toYaml . | nindent 4 }}
{{- end }}
spec:
{{- with .Values.prometheusRule.rules }}
groups:
- name: {{ template "nginx.name" $ }}
rules: {{ tpl (toYaml .) $ | nindent 8 }}
{{- end }}
{{- end }}
Common Metrics Patterns
Web Application Metrics
serviceMonitor:
enabled: true
interval: 15s
prometheusRule:
enabled: true
additionalLabels:
prometheus: kube-prometheus
rules:
# Request rate
- alert: LowRequestRate
expr: |
sum(rate(http_requests_total[5m])) < 1
for: 10m
labels:
severity: warning
# Error rate
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.01
for: 5m
labels:
severity: critical
# Latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 0.5
for: 10m
labels:
severity: warning
Exporter Metrics
# Example from prometheus-memcached-exporter
serviceMonitor:
enabled: true
interval: 30s
telemetryPath: /metrics
prometheusRule:
enabled: true
additionalLabels:
prometheus: kube-prometheus
rules:
- alert: MemcachedDown
expr: up{job="memcached-exporter"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Memcached instance {{ $labels.instance }} down"
- alert: MemcachedHighEvictionRate
expr: |
rate(memcached_items_evicted_total[5m]) > 100
for: 10m
labels:
severity: warning
annotations:
summary: "High eviction rate on {{ $labels.instance }}"
description: "Evicting {{ $value }} items per second"
- alert: MemcachedLowHitRate
expr: |
(
rate(memcached_commands_total{command="get",status="hit"}[5m])
/
rate(memcached_commands_total{command="get"}[5m])
) < 0.8
for: 15m
labels:
severity: warning
annotations:
summary: "Low cache hit rate"
description: "Hit rate is {{ $value | humanizePercentage }}"
Database Metrics
serviceMonitor:
enabled: true
interval: 30s
prometheusRule:
enabled: true
additionalLabels:
prometheus: kube-prometheus
rules:
- alert: DatabaseConnectionPoolExhausted
expr: |
(
sum(db_connections_active)
/
sum(db_connections_max)
) > 0.9
for: 5m
labels:
severity: critical
- alert: SlowQueries
expr: |
rate(db_query_duration_seconds_sum[5m])
/
rate(db_query_duration_seconds_count[5m]) > 1.0
for: 10m
labels:
severity: warning
- alert: DatabaseReplicationLag
expr: db_replication_lag_seconds > 30
for: 5m
labels:
severity: warning
Complete Production Example
production-monitoring.yaml
# ServiceMonitor configuration
serviceMonitor:
enabled: true
namespace: monitoring
interval: 30s
timeout: 10s
telemetryPath: /metrics
labels:
prometheus: kube-prometheus
release: prometheus-operator
team: platform
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: namespace
targetLabels:
- app
- version
# PrometheusRule configuration
prometheusRule:
enabled: true
namespace: monitoring
additionalLabels:
prometheus: kube-prometheus
team: platform
rules:
# Service availability
- alert: ServiceDown
expr: up{job="my-service"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "{{ $labels.job }} has been unavailable for 2 minutes"
runbook_url: "https://wiki.example.com/runbooks/service-down"
# Error rate
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }}"
# Latency
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "High request latency detected"
description: "99th percentile is {{ $value }}s"
# Resource usage
- alert: HighMemoryUsage
expr: |
(
container_memory_working_set_bytes{container="my-app"}
/
container_spec_memory_limit_bytes{container="my-app"}
) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory on {{ $labels.pod }}"
description: "Memory usage at {{ $value | humanizePercentage }}"
- alert: HighCPUUsage
expr: |
rate(container_cpu_usage_seconds_total{container="my-app"}[5m]) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.pod }}"
description: "CPU usage at {{ $value | humanize }} cores"
Verification
Check ServiceMonitor
# List ServiceMonitors
kubectl get servicemonitor -A
# Describe ServiceMonitor
kubectl describe servicemonitor <name> -n <namespace>
# Check if Prometheus discovered the target
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090/targets
Check PrometheusRule
# List PrometheusRules
kubectl get prometheusrule -A
# Describe PrometheusRule
kubectl describe prometheusrule <name> -n <namespace>
# Check if rules loaded in Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090/rules
Test Alerts
# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090/alerts to see alert status
# Port-forward to Alertmanager
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-alertmanager 9093:9093
# Open http://localhost:9093 to see firing alerts
Troubleshooting
ServiceMonitor Not Discovered
ServiceMonitor Not Discovered
Check labels match Prometheus selector:
# Get Prometheus serviceMonitorSelector
kubectl get prometheus -n monitoring -o yaml | grep -A 5 serviceMonitorSelector
# Ensure ServiceMonitor has matching labels
kubectl get servicemonitor <name> -n <namespace> -o yaml | grep -A 10 labels
No Metrics Scraped
No Metrics Scraped
Verify service and endpoints:
# Check service exists
kubectl get svc <service-name>
# Check endpoints
kubectl get endpoints <service-name>
# Test metrics endpoint
kubectl run curl --image=curlimages/curl -it --rm -- curl http://<service>:<port>/metrics
Rules Not Loading
Rules Not Loading
Check PrometheusRule syntax:
# Get the PrometheusRule
kubectl get prometheusrule <name> -n <namespace> -o yaml
# Check Prometheus logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus
Alerts Not Firing
Alerts Not Firing
Test the expression:
# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090 and test the alert expression
# Check the 'for' duration hasn't prevented the alert from firing
Next Steps
Values Configuration
Learn about values.yaml basics
Customization
Advanced customization techniques
Ingress
Configure external access
Chart Reference
Complete observability guide