Alerting Setup - Penn Labs Infrastructure

Overview

Penn Labs uses Grafana for alerting on infrastructure and application metrics. Alerts are triggered when specific thresholds are exceeded and notifications are sent to Slack.

Alert Architecture

The alerting system consists of three components:

Prometheus (Metrics) → Grafana (Alerts) → Slack (Notifications)

Prometheus

Prometheus scrapes metrics from Kubernetes and applications: Deployment (terraform/modules/base_cluster/monitoring.tf):

resource "helm_release" "prometheus" {
  name       = "prometheus"
  repository = "https://charts.helm.sh/stable"
  chart      = "prometheus"
  version    = "11.2.3"
  namespace  = kubernetes_namespace.monitoring.metadata[0].name
  
  values = var.prometheus_values
}

Configuration (terraform/helm/prometheus.yaml):

AlertManager: Disabled (using Grafana alerts instead)
Server Version: v2.13.1
Persistent Volume: 8Gi for storing metrics history
Namespace: monitoring

Why AlertManager is Disabled: Grafana provides a more user-friendly interface for managing alerts and integrates better with our dashboards. We use Grafana’s alerting instead of Prometheus AlertManager.

Grafana Alerts

Grafana evaluates alert rules defined in dashboards and triggers notifications. Alert Evaluation:

Queries Prometheus data source
Evaluates conditions (thresholds, rate of change, etc.)
Sends notifications when conditions are met
Respects notification policies (frequency, grouping)

Pod Alerting Dashboard: A dedicated dashboard (pod-alerting-dashboard.json) is used for alerts because Grafana doesn’t support variable datasources in alert rules.

Slack Notifications

When alerts fire, Grafana sends notifications to Slack. Configuration (terraform/helm/grafana.yaml):

notifiers:
  notifiers.yaml:
    notifiers:
      - name: Slack
        type: slack
        uid: slack
        org_id: 1
        is_default: true
        send_reminder: false
        settings:
          url: ${SLACK_NOTIFICATION_URL}

Slack Webhook URL: The webhook URL is stored in Vault and loaded as an environment variable:

module "vault" {
  source              = "./modules/vault"
  GF_SLACK_URL        = var.GF_SLACK_URL
  # ... other config
}

The grafana secret must contain:

SLACK_NOTIFICATION_URL - Incoming webhook URL from Slack

Alert Types

Pod Alerts

Alerts for pod health and resource usage: High CPU Usage:

Trigger: Pod using >90% CPU for 5+ minutes
Severity: Warning
Action: Check if pod needs more resources or is in a loop

High Memory Usage:

Trigger: Pod using >90% memory for 5+ minutes
Severity: Warning
Action: Check for memory leaks or increase limits

Excessive Restarts:

Trigger: Pod restarted >3 times in 1 hour
Severity: Critical
Action: Check logs for crash cause, may need rollback

Crash Loop:

Trigger: Pod in CrashLoopBackOff state
Severity: Critical
Action: Investigate immediately, service may be down

Pending Pods:

Trigger: Pod stuck in Pending state for >5 minutes
Severity: Warning
Action: Check node resources, may need to scale cluster

Certificate Alerts

Alerts for TLS certificate status: Certificate Expiring Soon:

Trigger: Certificate expires in less than 30 days
Severity: Warning
Action: Check cert-manager is functioning, may need manual renewal

Certificate Not Ready:

Trigger: Certificate in NotReady state for >10 minutes
Severity: Critical
Action: Check cert-manager logs, ACME challenge may be failing

Ingress Alerts

Alerts for Traefik ingress controller: High Error Rate:

Trigger: >5% requests returning 5xx errors
Severity: Critical
Action: Check application health, may need rollback

High Latency:

Trigger: p99 response time >2 seconds
Severity: Warning
Action: Investigate slow queries or external service issues

Backend Down:

Trigger: Traefik backend has 0 healthy instances
Severity: Critical
Action: Service is down, investigate immediately

Node Alerts

Alerts for Kubernetes node health: High Node CPU:

Trigger: Node CPU >85% for 10+ minutes
Severity: Warning
Action: May need to scale cluster or optimize pods

High Node Memory:

Trigger: Node memory >85% for 10+ minutes
Severity: Warning
Action: May need to scale cluster or adjust pod limits

Disk Space Low:

Trigger: Node disk >80% full
Severity: Critical
Action: Clean up old images/logs or increase disk size

Node Not Ready:

Trigger: Node in NotReady state
Severity: Critical
Action: Check node health, may need to replace node

Configuring Alerts

In Grafana Dashboard

Alerts are defined in dashboard panels:

Edit the panel:
- Click panel title → Edit
Add alert rule:
- Click “Alert” tab
- Click “Create Alert”

Define conditions:

WHEN avg() OF query(A, 5m, now) IS ABOVE 90

Set notification:
- Choose “Slack” as notification channel
- Add custom message if needed
Test the alert:
- Click “Test Rule” to verify query works
Save dashboard

Alert Rule Example

High CPU Alert:

{
  "alert": {
    "name": "High Pod CPU Usage",
    "message": "Pod {{ $labels.pod }} is using {{ $value }}% CPU",
    "frequency": "1m",
    "handler": 1,
    "conditions": [
      {
        "type": "query",
        "query": {
          "params": ["A", "5m", "now"]
        },
        "reducer": {
          "type": "avg"
        },
        "evaluator": {
          "type": "gt",
          "params": [90]
        }
      }
    ],
    "noDataState": "no_data",
    "executionErrorState": "alerting"
  }
}

Query:

sum(rate(container_cpu_usage_seconds_total{pod!=""}[5m])) by (pod) * 100

Notification Channels

Slack Channel

The primary notification channel is Slack. Setup:

Create Slack App:
- Go to https://api.slack.com/apps
- Create new app for workspace
- Enable “Incoming Webhooks”
- Add webhook to desired channel (e.g., #alerts)

Copy Webhook URL:

https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXX

Store in Vault:

vault kv put secrets/production/default/grafana \
  SLACK_NOTIFICATION_URL="https://hooks.slack.com/services/..."

Wait for sync:
- vault-secret-sync will update the grafana Kubernetes secret

Restart Grafana:

kubectl rollout restart deployment/grafana

Testing Notifications:

Go to Grafana → Alerting → Notification channels
Click “Slack”
Click “Test” button
Check Slack channel for test message

Adding Additional Channels

You can add more notification channels in Grafana: Email:

notifiers:
  notifiers.yaml:
    notifiers:
      - name: Email
        type: email
        uid: email
        settings:
          addresses: [email protected]

PagerDuty:

notifiers:
  notifiers.yaml:
    notifiers:
      - name: PagerDuty
        type: pagerduty
        uid: pagerduty
        settings:
          integrationKey: ${PAGERDUTY_KEY}

Webhook:

notifiers:
  notifiers.yaml:
    notifiers:
      - name: Custom Webhook
        type: webhook
        uid: webhook
        settings:
          url: https://example.com/alerts
          httpMethod: POST

Alert Best Practices

DO:

✓ Set appropriate thresholds (not too sensitive)
✓ Use evaluation periods to avoid flapping (e.g., 5m average)
✓ Include helpful context in alert messages
✓ Test alerts before deploying
✓ Group related alerts together
✓ Set different severity levels
✓ Document what each alert means and how to respond

DON’T:

✗ Alert on every small spike (alert fatigue)
✗ Use instant queries (too noisy)
✗ Forget to set no-data and error states
✗ Create alerts without testing them
✗ Ignore alerts (defeats the purpose)
✗ Send all alerts to everyone
✗ Use alerts for non-actionable information

Managing Alert Fatigue

Adjust Thresholds

If alerts are too noisy:

Review alert frequency in Grafana
Increase threshold or evaluation period
Consider if alert is actionable
Remove or adjust as needed

Notification Policies

Frequency: Control how often alerts are sent:

Send reminders: Disabled by default
State changes only: Only alert when state changes

Grouping: Group similar alerts together:

Combine alerts for same service
Send digest instead of individual messages

Silence Periods: Temporarily silence alerts during maintenance:

Use Grafana’s silence feature
Set time range for silence
Add comment explaining why

Alert Severity

Use different channels for different severity: Critical:

Immediate attention required
Send to on-call engineer
May page 24/7

Warning:

Issue developing, not immediate
Send to team channel
Review during business hours

Info:

Informational only
Log or send to separate channel
Review periodically

Responding to Alerts

Alert Response Process

Acknowledge:
- React to Slack message (eyes emoji)
- Indicates someone is looking into it
Investigate:
- Open Grafana dashboard
- Check recent deployments
- Review logs in Datadog or kubectl
- Determine root cause
Mitigate:
- Rollback if needed
- Scale resources if capacity issue
- Fix configuration
- Apply hotfix
Resolve:
- Verify metrics return to normal
- Update team on resolution
- Document in incident log
Follow-up:
- Create issue for root cause
- Adjust alert threshold if needed
- Improve monitoring if gap identified

Common Alert Scenarios

Scenario 1: High CPU Alert

Open Pod Dashboard in Grafana
Identify which pod(s) triggered alert
Check pod logs for unusual activity
Check recent deployments (possible regression)
Scale up if legitimate traffic spike
Rollback if caused by bad deployment

Scenario 2: Certificate Expiring

Open Cert Manager Dashboard
Check certificate status and expiry date
Verify cert-manager is running:
```
kubectl get pods -n cert-manager
```

Check certificate resource:

kubectl describe certificate <cert-name>

If not auto-renewing, see Certificate Renewal

Scenario 3: High Error Rate

Open Traefik Dashboard
Identify which backend/service is erroring
Check application logs:
```
kubectl logs -l app=<app-name>
```
Check database connectivity
Rollback if caused by recent deployment
Scale up if capacity issue

Monitoring the Monitoring

Ensure your monitoring stack is healthy:

Prometheus Health

# Check Prometheus is running
kubectl get pods -n monitoring -l app=prometheus

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Visit http://localhost:9090/targets

Grafana Health

# Check Grafana is running
kubectl get pods -l app=grafana

# Check logs for errors
kubectl logs -l app=grafana

# Test data source connection in UI

Alert Evaluation

In Grafana:

Go to Alerting → Alert Rules
Check “Last Evaluation” times
Ensure alerts are evaluating regularly
Review alert state history

Test Alerts: Periodically test that alerts fire:

Temporarily lower threshold
Wait for alert to fire
Verify Slack notification received
Restore original threshold

Terraform Configuration Reference

Prometheus Setup

Module: terraform/modules/base_cluster/monitoring.tf

resource "kubernetes_namespace" "monitoring" {
  metadata {
    name = "monitoring"
  }
}

resource "helm_release" "prometheus" {
  name       = "prometheus"
  repository = "https://charts.helm.sh/stable"
  chart      = "prometheus"
  version    = "11.2.3"
  namespace  = kubernetes_namespace.monitoring.metadata[0].name
  
  values = var.prometheus_values
}

Grafana Setup

File: terraform/production-cluster.tf

resource "helm_release" "grafana" {
  name       = "grafana"
  repository = "https://charts.helm.sh/stable"
  chart      = "grafana"
  version    = "5.1.4"
  
  values = [file("helm/grafana.yaml")]
}

Secrets:

module "vault" {
  source              = "./modules/vault"
  GF_GH_CLIENT_ID     = var.GF_GH_CLIENT_ID
  GF_GH_CLIENT_SECRET = var.GF_GH_CLIENT_SECRET
  GF_SLACK_URL        = var.GF_SLACK_URL
  # ...
}

These variables must be defined in terraform/variables.tf and set in your local .env file.

Troubleshooting

Alerts Not Firing

Symptom: Alert condition met but no notification Solutions:

Check alert is enabled in dashboard
Verify notification channel is configured
Test notification channel manually

Check Grafana logs for errors:

kubectl logs -l app=grafana | grep -i alert

Verify Slack webhook URL is correct

Duplicate Alerts

Symptom: Same alert firing multiple times Solutions:

Check if alert is defined in multiple dashboards
Adjust evaluation frequency
Enable “send reminders: false” in notifier config
Use alert grouping

Missing Metrics

Symptom: Alert shows “No data” Solutions:

Check Prometheus is scraping the target:
- Port-forward: kubectl port-forward -n monitoring svc/prometheus-server 9090:80
- Visit http://localhost:9090/targets
Verify query syntax is correct
Check time range is appropriate
Ensure pods have the correct labels for scraping

Slack Notifications Not Received

Symptom: Alert fires but no Slack message Solutions:

Verify webhook URL is correct:

kubectl get secret grafana -o jsonpath='{.data.SLACK_NOTIFICATION_URL}' | base64 -d

Test webhook URL manually:

curl -X POST <webhook-url> -H 'Content-Type: application/json' -d '{"text":"Test"}'

Check Slack app is still installed
Verify channel still exists
Check Grafana logs for HTTP errors

Deployment

Monitoring

Maintenance

​Overview

​Alert Architecture

​Prometheus

​Grafana Alerts

​Slack Notifications

​Alert Types

​Pod Alerts

​Certificate Alerts

​Ingress Alerts

​Node Alerts

​Configuring Alerts

​In Grafana Dashboard

​Alert Rule Example

​Notification Channels

​Slack Channel

​Adding Additional Channels

​Alert Best Practices

​DO:

​DON’T:

​Managing Alert Fatigue

​Adjust Thresholds

​Notification Policies

​Alert Severity

​Responding to Alerts

​Alert Response Process

​Common Alert Scenarios

​Monitoring the Monitoring

​Prometheus Health

​Grafana Health

​Alert Evaluation

​Terraform Configuration Reference

​Prometheus Setup

​Grafana Setup

​Troubleshooting

​Alerts Not Firing

​Duplicate Alerts

​Missing Metrics

​Slack Notifications Not Received

​Additional Resources

Build docs developers (and LLMs) love