Skip to main content

Overview

Penn Labs uses Grafana for alerting on infrastructure and application metrics. Alerts are triggered when specific thresholds are exceeded and notifications are sent to Slack.

Alert Architecture

The alerting system consists of three components:
Prometheus (Metrics) → Grafana (Alerts) → Slack (Notifications)

Prometheus

Prometheus scrapes metrics from Kubernetes and applications: Deployment (terraform/modules/base_cluster/monitoring.tf):
resource "helm_release" "prometheus" {
  name       = "prometheus"
  repository = "https://charts.helm.sh/stable"
  chart      = "prometheus"
  version    = "11.2.3"
  namespace  = kubernetes_namespace.monitoring.metadata[0].name
  
  values = var.prometheus_values
}
Configuration (terraform/helm/prometheus.yaml):
  • AlertManager: Disabled (using Grafana alerts instead)
  • Server Version: v2.13.1
  • Persistent Volume: 8Gi for storing metrics history
  • Namespace: monitoring
Why AlertManager is Disabled: Grafana provides a more user-friendly interface for managing alerts and integrates better with our dashboards. We use Grafana’s alerting instead of Prometheus AlertManager.

Grafana Alerts

Grafana evaluates alert rules defined in dashboards and triggers notifications. Alert Evaluation:
  • Queries Prometheus data source
  • Evaluates conditions (thresholds, rate of change, etc.)
  • Sends notifications when conditions are met
  • Respects notification policies (frequency, grouping)
Pod Alerting Dashboard: A dedicated dashboard (pod-alerting-dashboard.json) is used for alerts because Grafana doesn’t support variable datasources in alert rules.

Slack Notifications

When alerts fire, Grafana sends notifications to Slack. Configuration (terraform/helm/grafana.yaml):
notifiers:
  notifiers.yaml:
    notifiers:
      - name: Slack
        type: slack
        uid: slack
        org_id: 1
        is_default: true
        send_reminder: false
        settings:
          url: ${SLACK_NOTIFICATION_URL}
Slack Webhook URL: The webhook URL is stored in Vault and loaded as an environment variable:
module "vault" {
  source              = "./modules/vault"
  GF_SLACK_URL        = var.GF_SLACK_URL
  # ... other config
}
The grafana secret must contain:
  • SLACK_NOTIFICATION_URL - Incoming webhook URL from Slack

Alert Types

Pod Alerts

Alerts for pod health and resource usage: High CPU Usage:
  • Trigger: Pod using >90% CPU for 5+ minutes
  • Severity: Warning
  • Action: Check if pod needs more resources or is in a loop
High Memory Usage:
  • Trigger: Pod using >90% memory for 5+ minutes
  • Severity: Warning
  • Action: Check for memory leaks or increase limits
Excessive Restarts:
  • Trigger: Pod restarted >3 times in 1 hour
  • Severity: Critical
  • Action: Check logs for crash cause, may need rollback
Crash Loop:
  • Trigger: Pod in CrashLoopBackOff state
  • Severity: Critical
  • Action: Investigate immediately, service may be down
Pending Pods:
  • Trigger: Pod stuck in Pending state for >5 minutes
  • Severity: Warning
  • Action: Check node resources, may need to scale cluster

Certificate Alerts

Alerts for TLS certificate status: Certificate Expiring Soon:
  • Trigger: Certificate expires in less than 30 days
  • Severity: Warning
  • Action: Check cert-manager is functioning, may need manual renewal
Certificate Not Ready:
  • Trigger: Certificate in NotReady state for >10 minutes
  • Severity: Critical
  • Action: Check cert-manager logs, ACME challenge may be failing

Ingress Alerts

Alerts for Traefik ingress controller: High Error Rate:
  • Trigger: >5% requests returning 5xx errors
  • Severity: Critical
  • Action: Check application health, may need rollback
High Latency:
  • Trigger: p99 response time >2 seconds
  • Severity: Warning
  • Action: Investigate slow queries or external service issues
Backend Down:
  • Trigger: Traefik backend has 0 healthy instances
  • Severity: Critical
  • Action: Service is down, investigate immediately

Node Alerts

Alerts for Kubernetes node health: High Node CPU:
  • Trigger: Node CPU >85% for 10+ minutes
  • Severity: Warning
  • Action: May need to scale cluster or optimize pods
High Node Memory:
  • Trigger: Node memory >85% for 10+ minutes
  • Severity: Warning
  • Action: May need to scale cluster or adjust pod limits
Disk Space Low:
  • Trigger: Node disk >80% full
  • Severity: Critical
  • Action: Clean up old images/logs or increase disk size
Node Not Ready:
  • Trigger: Node in NotReady state
  • Severity: Critical
  • Action: Check node health, may need to replace node

Configuring Alerts

In Grafana Dashboard

Alerts are defined in dashboard panels:
  1. Edit the panel:
    • Click panel title → Edit
  2. Add alert rule:
    • Click “Alert” tab
    • Click “Create Alert”
  3. Define conditions:
    WHEN avg() OF query(A, 5m, now) IS ABOVE 90
    
  4. Set notification:
    • Choose “Slack” as notification channel
    • Add custom message if needed
  5. Test the alert:
    • Click “Test Rule” to verify query works
  6. Save dashboard

Alert Rule Example

High CPU Alert:
{
  "alert": {
    "name": "High Pod CPU Usage",
    "message": "Pod {{ $labels.pod }} is using {{ $value }}% CPU",
    "frequency": "1m",
    "handler": 1,
    "conditions": [
      {
        "type": "query",
        "query": {
          "params": ["A", "5m", "now"]
        },
        "reducer": {
          "type": "avg"
        },
        "evaluator": {
          "type": "gt",
          "params": [90]
        }
      }
    ],
    "noDataState": "no_data",
    "executionErrorState": "alerting"
  }
}
Query:
sum(rate(container_cpu_usage_seconds_total{pod!=""}[5m])) by (pod) * 100

Notification Channels

Slack Channel

The primary notification channel is Slack. Setup:
  1. Create Slack App:
    • Go to https://api.slack.com/apps
    • Create new app for workspace
    • Enable “Incoming Webhooks”
    • Add webhook to desired channel (e.g., #alerts)
  2. Copy Webhook URL:
    https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXX
    
  3. Store in Vault:
    vault kv put secrets/production/default/grafana \
      SLACK_NOTIFICATION_URL="https://hooks.slack.com/services/..."
    
  4. Wait for sync:
    • vault-secret-sync will update the grafana Kubernetes secret
  5. Restart Grafana:
    kubectl rollout restart deployment/grafana
    
Testing Notifications:
  1. Go to Grafana → Alerting → Notification channels
  2. Click “Slack”
  3. Click “Test” button
  4. Check Slack channel for test message

Adding Additional Channels

You can add more notification channels in Grafana: Email:
notifiers:
  notifiers.yaml:
    notifiers:
      - name: Email
        type: email
        uid: email
        settings:
          addresses: [email protected]
PagerDuty:
notifiers:
  notifiers.yaml:
    notifiers:
      - name: PagerDuty
        type: pagerduty
        uid: pagerduty
        settings:
          integrationKey: ${PAGERDUTY_KEY}
Webhook:
notifiers:
  notifiers.yaml:
    notifiers:
      - name: Custom Webhook
        type: webhook
        uid: webhook
        settings:
          url: https://example.com/alerts
          httpMethod: POST

Alert Best Practices

DO:

  • ✓ Set appropriate thresholds (not too sensitive)
  • ✓ Use evaluation periods to avoid flapping (e.g., 5m average)
  • ✓ Include helpful context in alert messages
  • ✓ Test alerts before deploying
  • ✓ Group related alerts together
  • ✓ Set different severity levels
  • ✓ Document what each alert means and how to respond

DON’T:

  • ✗ Alert on every small spike (alert fatigue)
  • ✗ Use instant queries (too noisy)
  • ✗ Forget to set no-data and error states
  • ✗ Create alerts without testing them
  • ✗ Ignore alerts (defeats the purpose)
  • ✗ Send all alerts to everyone
  • ✗ Use alerts for non-actionable information

Managing Alert Fatigue

Adjust Thresholds

If alerts are too noisy:
  1. Review alert frequency in Grafana
  2. Increase threshold or evaluation period
  3. Consider if alert is actionable
  4. Remove or adjust as needed

Notification Policies

Frequency: Control how often alerts are sent:
  • Send reminders: Disabled by default
  • State changes only: Only alert when state changes
Grouping: Group similar alerts together:
  • Combine alerts for same service
  • Send digest instead of individual messages
Silence Periods: Temporarily silence alerts during maintenance:
  • Use Grafana’s silence feature
  • Set time range for silence
  • Add comment explaining why

Alert Severity

Use different channels for different severity: Critical:
  • Immediate attention required
  • Send to on-call engineer
  • May page 24/7
Warning:
  • Issue developing, not immediate
  • Send to team channel
  • Review during business hours
Info:
  • Informational only
  • Log or send to separate channel
  • Review periodically

Responding to Alerts

Alert Response Process

  1. Acknowledge:
    • React to Slack message (eyes emoji)
    • Indicates someone is looking into it
  2. Investigate:
    • Open Grafana dashboard
    • Check recent deployments
    • Review logs in Datadog or kubectl
    • Determine root cause
  3. Mitigate:
    • Rollback if needed
    • Scale resources if capacity issue
    • Fix configuration
    • Apply hotfix
  4. Resolve:
    • Verify metrics return to normal
    • Update team on resolution
    • Document in incident log
  5. Follow-up:
    • Create issue for root cause
    • Adjust alert threshold if needed
    • Improve monitoring if gap identified

Common Alert Scenarios

Scenario 1: High CPU Alert
  1. Open Pod Dashboard in Grafana
  2. Identify which pod(s) triggered alert
  3. Check pod logs for unusual activity
  4. Check recent deployments (possible regression)
  5. Scale up if legitimate traffic spike
  6. Rollback if caused by bad deployment
Scenario 2: Certificate Expiring
  1. Open Cert Manager Dashboard
  2. Check certificate status and expiry date
  3. Verify cert-manager is running:
    kubectl get pods -n cert-manager
    
  4. Check certificate resource:
    kubectl describe certificate <cert-name>
    
  5. If not auto-renewing, see Certificate Renewal
Scenario 3: High Error Rate
  1. Open Traefik Dashboard
  2. Identify which backend/service is erroring
  3. Check application logs:
    kubectl logs -l app=<app-name>
    
  4. Check database connectivity
  5. Rollback if caused by recent deployment
  6. Scale up if capacity issue

Monitoring the Monitoring

Ensure your monitoring stack is healthy:

Prometheus Health

# Check Prometheus is running
kubectl get pods -n monitoring -l app=prometheus

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Visit http://localhost:9090/targets

Grafana Health

# Check Grafana is running
kubectl get pods -l app=grafana

# Check logs for errors
kubectl logs -l app=grafana

# Test data source connection in UI

Alert Evaluation

In Grafana:
  1. Go to Alerting → Alert Rules
  2. Check “Last Evaluation” times
  3. Ensure alerts are evaluating regularly
  4. Review alert state history
Test Alerts: Periodically test that alerts fire:
  1. Temporarily lower threshold
  2. Wait for alert to fire
  3. Verify Slack notification received
  4. Restore original threshold

Terraform Configuration Reference

Prometheus Setup

Module: terraform/modules/base_cluster/monitoring.tf
resource "kubernetes_namespace" "monitoring" {
  metadata {
    name = "monitoring"
  }
}

resource "helm_release" "prometheus" {
  name       = "prometheus"
  repository = "https://charts.helm.sh/stable"
  chart      = "prometheus"
  version    = "11.2.3"
  namespace  = kubernetes_namespace.monitoring.metadata[0].name
  
  values = var.prometheus_values
}

Grafana Setup

File: terraform/production-cluster.tf
resource "helm_release" "grafana" {
  name       = "grafana"
  repository = "https://charts.helm.sh/stable"
  chart      = "grafana"
  version    = "5.1.4"
  
  values = [file("helm/grafana.yaml")]
}
Secrets:
module "vault" {
  source              = "./modules/vault"
  GF_GH_CLIENT_ID     = var.GF_GH_CLIENT_ID
  GF_GH_CLIENT_SECRET = var.GF_GH_CLIENT_SECRET
  GF_SLACK_URL        = var.GF_SLACK_URL
  # ...
}
These variables must be defined in terraform/variables.tf and set in your local .env file.

Troubleshooting

Alerts Not Firing

Symptom: Alert condition met but no notification Solutions:
  1. Check alert is enabled in dashboard
  2. Verify notification channel is configured
  3. Test notification channel manually
  4. Check Grafana logs for errors:
    kubectl logs -l app=grafana | grep -i alert
    
  5. Verify Slack webhook URL is correct

Duplicate Alerts

Symptom: Same alert firing multiple times Solutions:
  1. Check if alert is defined in multiple dashboards
  2. Adjust evaluation frequency
  3. Enable “send reminders: false” in notifier config
  4. Use alert grouping

Missing Metrics

Symptom: Alert shows “No data” Solutions:
  1. Check Prometheus is scraping the target:
  2. Verify query syntax is correct
  3. Check time range is appropriate
  4. Ensure pods have the correct labels for scraping

Slack Notifications Not Received

Symptom: Alert fires but no Slack message Solutions:
  1. Verify webhook URL is correct:
    kubectl get secret grafana -o jsonpath='{.data.SLACK_NOTIFICATION_URL}' | base64 -d
    
  2. Test webhook URL manually:
    curl -X POST <webhook-url> -H 'Content-Type: application/json' -d '{"text":"Test"}'
    
  3. Check Slack app is still installed
  4. Verify channel still exists
  5. Check Grafana logs for HTTP errors

Additional Resources

Build docs developers (and LLMs) love