Grafana
Grafana is the leading open-source platform for metrics visualization and monitoring. Combined with Prometheus, it provides powerful monitoring for Kubernetes clusters and ML applications.Why Grafana and Prometheus?
This stack is the de facto standard for Kubernetes monitoring:- Prometheus: Time-series database for metrics collection
- Grafana: Visualization layer with rich dashboards and alerting
- kube-prometheus-stack: Batteries-included Helm chart with:
- Prometheus Operator
- Node exporters
- Pre-built Kubernetes dashboards
- AlertManager for notifications
While SigNoz focuses on traces and application observability, Grafana excels at system metrics, resource monitoring, and long-term trend analysis.
Architecture
- Node Exporter: Collects hardware and OS metrics from nodes
- Kube State Metrics: Exposes Kubernetes object state as metrics
- Prometheus: Scrapes and stores metrics
- Grafana: Queries Prometheus and renders dashboards
- AlertManager: Handles alerts and notifications
Prerequisites
- Kubernetes cluster (kind, minikube, or cloud-based)
kubectlconfiguredhelm3.x installed- At least 2GB RAM available for monitoring components
Installation
Step 1: Add Prometheus Community Helm Repository
Step 2: Install kube-prometheus-stack
- Prometheus Operator
- Prometheus server
- Grafana
- AlertManager
- Node exporters
- Kube state metrics
- Pre-configured dashboards and alerts
Step 3: Verify Installation
Accessing Grafana
Get Admin Credentials
The default credentials are:- Username:
admin - Password:
prom-operator
Port Forward Grafana
First Login
- Navigate to
http://localhost:3000 - Log in with
admin/prom-operator - (Optional) Change password in profile settings
Pre-built Dashboards
The kube-prometheus-stack includes excellent dashboards out of the box:Kubernetes Cluster Monitoring
Navigate to Dashboards → Browse to find:Kubernetes / Compute Resources / Cluster
Kubernetes / Compute Resources / Cluster
Overview of cluster-wide resource usage:
- CPU usage and requests
- Memory usage and requests
- Network I/O
- Pod count
Kubernetes / Compute Resources / Namespace
Kubernetes / Compute Resources / Namespace
Resource usage broken down by namespace:
- CPU and memory per namespace
- Pod counts
- Network traffic
Kubernetes / Compute Resources / Pod
Kubernetes / Compute Resources / Pod
Individual pod metrics:
- CPU usage per container
- Memory usage per container
- Restart counts
- Network usage
Node Exporter / Nodes
Node Exporter / Nodes
Hardware-level metrics for each node:
- CPU usage (system, user, idle)
- Memory usage (used, cached, buffered)
- Disk I/O and usage
- Network traffic
- System load average
Creating Custom Dashboards
Exposing Metrics from Your Application
First, expose metrics from your Python application:Configure Prometheus to Scrape Your App
Create a ServiceMonitor to tell Prometheus about your app:Prometheus will automatically discover and scrape any service matching the selector.
Build a Custom Dashboard
- In Grafana, click + → Dashboard
- Click Add visualization
- Select Prometheus as the data source
- Enter a PromQL query
Example Queries
Dashboard Example: ML Model Monitoring
Create a dashboard with these panels:- Request Rate: Line graph of requests per second
- Error Rate: Percentage of failed requests
- Latency: p50, p95, p99 percentiles
- Model Distribution: Pie chart of requests by model
- Resource Usage: CPU and memory consumption
- Pod Health: Current pod count and restart rate
Prometheus Query Language (PromQL)
PromQL is the query language for Prometheus. Key concepts:Instant Vectors
Range Vectors
Aggregation
Rate Function
rate() function is essential for calculating metrics from counters.
Combining Queries
Alerting
Define Alert Rules
Create a PrometheusRule resource:Configure AlertManager
Edit AlertManager config to send notifications:Advanced Features
Service Level Objectives (SLOs)
Define SLOs using recording rules:Grafana Annotations
Add annotations to mark deployments or incidents:Troubleshooting
No data in dashboards
No data in dashboards
-
Check Prometheus is running:
-
Verify Prometheus is scraping targets:
Visit http://localhost:9090/targets
-
Check for errors in Prometheus logs:
Grafana can't connect to Prometheus
Grafana can't connect to Prometheus
-
Check the data source configuration in Grafana:
- Go to Configuration → Data Sources
- Click Prometheus
- Click Test button
-
Verify the URL is correct (usually
http://monitoring-kube-prometheus-prometheus.default:9090) - Check network policies aren’t blocking access
Custom metrics not appearing
Custom metrics not appearing
-
Verify ServiceMonitor is created:
-
Check ServiceMonitor has correct labels:
Must have
release: monitoringlabel - Check Prometheus discovered the target: Visit http://localhost:9090/targets and search for your service
Cleanup
To uninstall the monitoring stack:Best Practices
Label Cardinality
Avoid high-cardinality labels (e.g., user IDs) as they explode metric storage. Use labels for finite sets like model names or environments.
Scrape Intervals
Balance between freshness and overhead. 30s is a good default; use 10s only for critical metrics.
Metric Retention
Default is 15 days. Increase for long-term trend analysis, but monitor storage usage.
Alert Fatigue
Only alert on actionable issues. Use severity levels (critical, warning, info) and smart routing.
Additional Resources
Next Steps
Data Monitoring
Learn about ML-specific monitoring with Evidently and Seldon for drift detection