Overview
Penn Labs uses Grafana for alerting on infrastructure and application metrics. Alerts are triggered when specific thresholds are exceeded and notifications are sent to Slack.Alert Architecture
The alerting system consists of three components:Prometheus
Prometheus scrapes metrics from Kubernetes and applications: Deployment (terraform/modules/base_cluster/monitoring.tf):
terraform/helm/prometheus.yaml):
- AlertManager: Disabled (using Grafana alerts instead)
- Server Version: v2.13.1
- Persistent Volume: 8Gi for storing metrics history
- Namespace:
monitoring
Grafana Alerts
Grafana evaluates alert rules defined in dashboards and triggers notifications. Alert Evaluation:- Queries Prometheus data source
- Evaluates conditions (thresholds, rate of change, etc.)
- Sends notifications when conditions are met
- Respects notification policies (frequency, grouping)
pod-alerting-dashboard.json) is used for alerts because Grafana doesn’t support variable datasources in alert rules.
Slack Notifications
When alerts fire, Grafana sends notifications to Slack. Configuration (terraform/helm/grafana.yaml):
grafana secret must contain:
SLACK_NOTIFICATION_URL- Incoming webhook URL from Slack
Alert Types
Pod Alerts
Alerts for pod health and resource usage: High CPU Usage:- Trigger: Pod using >90% CPU for 5+ minutes
- Severity: Warning
- Action: Check if pod needs more resources or is in a loop
- Trigger: Pod using >90% memory for 5+ minutes
- Severity: Warning
- Action: Check for memory leaks or increase limits
- Trigger: Pod restarted >3 times in 1 hour
- Severity: Critical
- Action: Check logs for crash cause, may need rollback
- Trigger: Pod in CrashLoopBackOff state
- Severity: Critical
- Action: Investigate immediately, service may be down
- Trigger: Pod stuck in Pending state for >5 minutes
- Severity: Warning
- Action: Check node resources, may need to scale cluster
Certificate Alerts
Alerts for TLS certificate status: Certificate Expiring Soon:- Trigger: Certificate expires in less than 30 days
- Severity: Warning
- Action: Check cert-manager is functioning, may need manual renewal
- Trigger: Certificate in NotReady state for >10 minutes
- Severity: Critical
- Action: Check cert-manager logs, ACME challenge may be failing
Ingress Alerts
Alerts for Traefik ingress controller: High Error Rate:- Trigger: >5% requests returning 5xx errors
- Severity: Critical
- Action: Check application health, may need rollback
- Trigger: p99 response time >2 seconds
- Severity: Warning
- Action: Investigate slow queries or external service issues
- Trigger: Traefik backend has 0 healthy instances
- Severity: Critical
- Action: Service is down, investigate immediately
Node Alerts
Alerts for Kubernetes node health: High Node CPU:- Trigger: Node CPU >85% for 10+ minutes
- Severity: Warning
- Action: May need to scale cluster or optimize pods
- Trigger: Node memory >85% for 10+ minutes
- Severity: Warning
- Action: May need to scale cluster or adjust pod limits
- Trigger: Node disk >80% full
- Severity: Critical
- Action: Clean up old images/logs or increase disk size
- Trigger: Node in NotReady state
- Severity: Critical
- Action: Check node health, may need to replace node
Configuring Alerts
In Grafana Dashboard
Alerts are defined in dashboard panels:-
Edit the panel:
- Click panel title → Edit
-
Add alert rule:
- Click “Alert” tab
- Click “Create Alert”
-
Define conditions:
-
Set notification:
- Choose “Slack” as notification channel
- Add custom message if needed
-
Test the alert:
- Click “Test Rule” to verify query works
- Save dashboard
Alert Rule Example
High CPU Alert:Notification Channels
Slack Channel
The primary notification channel is Slack. Setup:-
Create Slack App:
- Go to https://api.slack.com/apps
- Create new app for workspace
- Enable “Incoming Webhooks”
- Add webhook to desired channel (e.g.,
#alerts)
-
Copy Webhook URL:
-
Store in Vault:
-
Wait for sync:
vault-secret-syncwill update thegrafanaKubernetes secret
-
Restart Grafana:
- Go to Grafana → Alerting → Notification channels
- Click “Slack”
- Click “Test” button
- Check Slack channel for test message
Adding Additional Channels
You can add more notification channels in Grafana: Email:Alert Best Practices
DO:
- ✓ Set appropriate thresholds (not too sensitive)
- ✓ Use evaluation periods to avoid flapping (e.g., 5m average)
- ✓ Include helpful context in alert messages
- ✓ Test alerts before deploying
- ✓ Group related alerts together
- ✓ Set different severity levels
- ✓ Document what each alert means and how to respond
DON’T:
- ✗ Alert on every small spike (alert fatigue)
- ✗ Use instant queries (too noisy)
- ✗ Forget to set no-data and error states
- ✗ Create alerts without testing them
- ✗ Ignore alerts (defeats the purpose)
- ✗ Send all alerts to everyone
- ✗ Use alerts for non-actionable information
Managing Alert Fatigue
Adjust Thresholds
If alerts are too noisy:- Review alert frequency in Grafana
- Increase threshold or evaluation period
- Consider if alert is actionable
- Remove or adjust as needed
Notification Policies
Frequency: Control how often alerts are sent:- Send reminders: Disabled by default
- State changes only: Only alert when state changes
- Combine alerts for same service
- Send digest instead of individual messages
- Use Grafana’s silence feature
- Set time range for silence
- Add comment explaining why
Alert Severity
Use different channels for different severity: Critical:- Immediate attention required
- Send to on-call engineer
- May page 24/7
- Issue developing, not immediate
- Send to team channel
- Review during business hours
- Informational only
- Log or send to separate channel
- Review periodically
Responding to Alerts
Alert Response Process
-
Acknowledge:
- React to Slack message (eyes emoji)
- Indicates someone is looking into it
-
Investigate:
- Open Grafana dashboard
- Check recent deployments
- Review logs in Datadog or kubectl
- Determine root cause
-
Mitigate:
- Rollback if needed
- Scale resources if capacity issue
- Fix configuration
- Apply hotfix
-
Resolve:
- Verify metrics return to normal
- Update team on resolution
- Document in incident log
-
Follow-up:
- Create issue for root cause
- Adjust alert threshold if needed
- Improve monitoring if gap identified
Common Alert Scenarios
Scenario 1: High CPU Alert- Open Pod Dashboard in Grafana
- Identify which pod(s) triggered alert
- Check pod logs for unusual activity
- Check recent deployments (possible regression)
- Scale up if legitimate traffic spike
- Rollback if caused by bad deployment
- Open Cert Manager Dashboard
- Check certificate status and expiry date
- Verify cert-manager is running:
- Check certificate resource:
- If not auto-renewing, see Certificate Renewal
- Open Traefik Dashboard
- Identify which backend/service is erroring
- Check application logs:
- Check database connectivity
- Rollback if caused by recent deployment
- Scale up if capacity issue
Monitoring the Monitoring
Ensure your monitoring stack is healthy:Prometheus Health
Grafana Health
Alert Evaluation
In Grafana:- Go to Alerting → Alert Rules
- Check “Last Evaluation” times
- Ensure alerts are evaluating regularly
- Review alert state history
- Temporarily lower threshold
- Wait for alert to fire
- Verify Slack notification received
- Restore original threshold
Terraform Configuration Reference
Prometheus Setup
Module:terraform/modules/base_cluster/monitoring.tf
Grafana Setup
File:terraform/production-cluster.tf
terraform/variables.tf and set in your local .env file.
Troubleshooting
Alerts Not Firing
Symptom: Alert condition met but no notification Solutions:- Check alert is enabled in dashboard
- Verify notification channel is configured
- Test notification channel manually
- Check Grafana logs for errors:
- Verify Slack webhook URL is correct
Duplicate Alerts
Symptom: Same alert firing multiple times Solutions:- Check if alert is defined in multiple dashboards
- Adjust evaluation frequency
- Enable “send reminders: false” in notifier config
- Use alert grouping
Missing Metrics
Symptom: Alert shows “No data” Solutions:- Check Prometheus is scraping the target:
- Port-forward:
kubectl port-forward -n monitoring svc/prometheus-server 9090:80 - Visit http://localhost:9090/targets
- Port-forward:
- Verify query syntax is correct
- Check time range is appropriate
- Ensure pods have the correct labels for scraping
Slack Notifications Not Received
Symptom: Alert fires but no Slack message Solutions:- Verify webhook URL is correct:
- Test webhook URL manually:
- Check Slack app is still installed
- Verify channel still exists
- Check Grafana logs for HTTP errors