Overview
Penn Labs uses Grafana for monitoring infrastructure and application metrics. Grafana is connected to Prometheus as a data source and displays real-time metrics through various dashboards. Access: https://grafana.pennlabs.org Authentication: GitHub OAuth (requires membership inpennlabs organization)
Dashboard Architecture
Grafana is deployed in the Kubernetes cluster using Helm: Configuration (terraform/helm/grafana.yaml):
- Persistence: Enabled with 10Gi StatefulSet
- Plugins: grafana-piechart-panel
- Ingress: Available at grafana.pennlabs.org with TLS
- Data Source: Prometheus server in monitoring namespace
Available Dashboards
Penn Labs maintains several custom dashboards in thegrafana-dashboards/ directory.
1. Traefik Dashboard
Purpose: Monitor Traefik 1.7 ingress controller instances File:grafana-dashboards/traefik.json
Metrics Shown:
- Request rate and response times
- HTTP status code breakdown (2xx, 4xx, 5xx)
- Backend health and availability
- Request duration percentiles
- Active connections
- Diagnose slow API responses
- Identify traffic spikes
- Monitor ingress health
- Track error rates by endpoint
2. Pod Dashboard
Purpose: Monitor the status of all pods across clusters File:grafana-dashboards/pod-dashboard.json
Metrics Shown:
- Pod status (Running, Pending, Failed)
- Container restarts
- Resource usage (CPU, Memory)
- Pod age and uptime
- Node distribution
- Quick overview of cluster health
- Identify pods with issues
- Monitor deployment rollouts
- Check resource utilization
- Detect crash loops
3. Pod Alerting Dashboard
Purpose: Alert when pods exceed normal conditions File:grafana-dashboards/pod-alerting-dashboard.json
Why It Exists:
Grafana currently doesn’t allow variable datasources within alerts, so this dedicated dashboard provides alerts for pod metrics.
Alerts Configured:
- High CPU usage
- High memory usage
- Excessive container restarts
- Pod crash loops
- Pods stuck in Pending state
4. Cert Manager Dashboard
Purpose: Monitor TLS certificate status and expiration File:grafana-dashboards/cert-manager.json
Metrics Shown:
- Certificate expiration dates
- Certificate renewal status
- ACME challenge success/failure
- Certificate issuance time
- Ready vs Not Ready certificates
- Prevent certificate expiration
- Monitor cert-manager health
- Troubleshoot certificate issues
- Track certificate renewal process
5. Node Exporter Dashboard
Purpose: Monitor Kubernetes node hardware and OS metrics Source: Grafana community dashboard 1860 (revision 19) Metrics Shown:- CPU usage and load average
- Memory and swap usage
- Disk I/O and space
- Network traffic
- System uptime
- Monitor node health
- Identify resource bottlenecks
- Plan capacity upgrades
- Diagnose performance issues
Dashboard Configuration
Dashboards are configured interraform/production-cluster.tf:
terraform/helm/grafana.yaml:
- Edit the JSON file in
grafana-dashboards/ - Commit and push to the
masterbranch - Grafana will reload the dashboard automatically
Data Source Configuration
Grafana is connected to Prometheus as its primary data source:- Namespace:
monitoring - Service:
prometheus-server - Port: 80 (HTTP)
Alerting Configuration
Grafana sends alerts to Slack when thresholds are exceeded. Slack Integration (terraform/helm/grafana.yaml):
grafana Kubernetes secret:
ADMIN_USER- Grafana admin usernameADMIN_PASSWORD- Grafana admin passwordSLACK_NOTIFICATION_URL- Slack webhook for alerts
Using the Dashboards
Accessing Grafana
- Navigate to https://grafana.pennlabs.org
- Click “Sign in with GitHub”
- Authorize the Grafana app
- You’ll be logged in if you’re a member of
pennlabsorganization
Navigating Dashboards
Home Screen:- Click the Grafana logo (top left) to see all dashboards
- Use search to find specific dashboards
- Star frequently used dashboards for quick access
- Use the time picker (top right) to adjust time window
- Common ranges: Last 5m, 15m, 1h, 6h, 24h
- Use “Refresh” dropdown to auto-refresh
- Select specific namespace, pod, or service
- Variables filter the displayed metrics
Common Workflows
Checking Deployment Health
- Open Pod Dashboard
- Look for pods with status not “Running”
- Check container restart counts
- Review resource usage for the application
Investigating Slow Responses
- Open Traefik Dashboard
- Check request duration percentiles
- Look for increased latency in specific backends
- Correlate with pod resource usage
Monitoring Certificate Expiration
- Open Cert Manager Dashboard
- Check “Days Until Expiration” panel
- Verify certificates are renewing automatically
- Investigate any certificates in “Not Ready” state
Troubleshooting Node Issues
- Open Node Exporter Dashboard
- Select the problematic node
- Check CPU, memory, and disk usage
- Review network traffic and errors
Creating Custom Dashboards
In the UI (Temporary)
- Click ”+” → “Dashboard”
- Add panels with queries
- Save the dashboard
In Code (Recommended)
-
Export the dashboard:
- Click dashboard settings (gear icon)
- Click “JSON Model”
- Copy the JSON
-
Save to repository:
-
Add to Grafana config in
terraform/helm/grafana.yaml: -
Apply the change:
-
Commit to Git:
Troubleshooting
Dashboard Not Loading
Symptom: Dashboard shows “No data” or doesn’t load Solutions:-
Check Prometheus is running:
-
Verify data source connection:
- Go to Configuration → Data Sources
- Click “Prometheus”
- Click “Test” button
- Check time range is appropriate (not too far in past/future)
Alerts Not Firing
Symptom: No Slack notifications despite threshold being exceeded Solutions:- Check alert rules are defined in dashboard
- Verify Slack webhook URL is correct:
- Test notification channel in Grafana UI
- Check Grafana logs for errors:
Can’t Access Grafana
Symptom: GitHub login doesn’t work Solutions:- Verify you’re a member of
pennlabsGitHub organization - Check GitHub OAuth app configuration
- Ensure you’ve authorized the Grafana app
- Try incognito/private browsing mode
Dashboard Changes Not Appearing
Symptom: Updated dashboard JSON not showing in Grafana Solutions:- Verify file is committed to master branch
- Check file URL is accessible (test in browser)
- Restart Grafana pod:
- Clear browser cache