Skip to main content

Overview

PagerDuty integration allows you to escalate critical CronJob failures to your on-call engineers. This is essential for:
  • 24/7 monitoring of business-critical jobs
  • Automatic escalation if alerts aren’t acknowledged
  • Integration with on-call schedules and rotation
  • Incident tracking and post-mortems

Quick Start

1

Get PagerDuty Routing Key

  1. Log in to your PagerDuty account
  2. Go to Services > Service Directory
  3. Select or create a service (e.g., “CronJob Failures”)
  4. Go to Integrations tab
  5. Click Add Integration
  6. Select Events API v2
  7. Copy the Integration Key (routing key)
2

Create Kubernetes Secret

Store the routing key in a secret:
kubectl create secret generic pagerduty-key \
  --namespace cronjob-guardian \
  --from-literal=routing-key=<your-integration-key>
3

Create PagerDuty AlertChannel

kubectl apply -f - <<EOF
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
  name: pagerduty-critical
spec:
  type: pagerduty
  pagerduty:
    routingKeySecretRef:
      name: pagerduty-key
      namespace: cronjob-guardian
      key: routing-key
    severity: critical
EOF
4

Reference in CronJobMonitor

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: critical-jobs
  namespace: production
spec:
  selector:
    matchLabels:
      tier: critical
  alerting:
    channelRefs:
      - name: pagerduty-critical
        severities: [critical]  # Only page for critical alerts

Basic PagerDuty AlertChannel

Here’s the example from the repository:
alertchannels/pagerduty.yaml
# PagerDuty AlertChannel
# Sends critical alerts to PagerDuty for on-call escalation
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
  name: pagerduty-critical
spec:
  type: pagerduty
  pagerduty:
    routingKeySecretRef:
      name: pagerduty-key
      namespace: cronjob-guardian
      key: routing-key
    severity: critical

Configuration Options

pagerduty.routingKeySecretRef
object
required
Reference to a Kubernetes Secret containing the PagerDuty integration key
pagerduty.severity
string
Default PagerDuty severity level: critical, error, warning, or infoTypically set to critical for on-call escalation.

Multi-Team PagerDuty Setup

Create separate PagerDuty services and AlertChannels for different teams:
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
  name: pagerduty-dba
spec:
  type: pagerduty
  pagerduty:
    routingKeySecretRef:
      name: pagerduty-dba-key
      namespace: cronjob-guardian
      key: routing-key
    severity: critical

Creating Team-Specific Secrets

# DBA team
kubectl create secret generic pagerduty-dba-key \
  --namespace cronjob-guardian \
  --from-literal=routing-key=<dba-integration-key>

# Platform team
kubectl create secret generic pagerduty-platform-key \
  --namespace cronjob-guardian \
  --from-literal=routing-key=<platform-integration-key>

# Data engineering team
kubectl create secret generic pagerduty-data-key \
  --namespace cronjob-guardian \
  --from-literal=routing-key=<data-integration-key>

Critical Job Monitoring with PagerDuty

Escalate only critical failures to PagerDuty, while sending all alerts to Slack:
Database Backups
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: database-backups
  namespace: databases
spec:
  selector:
    matchLabels:
      type: backup
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  sla:
    enabled: true
    minSuccessRate: 100  # Backups must never fail
    maxDuration: 1h
  alerting:
    channelRefs:
      # Page on-call DBA for critical issues
      - name: pagerduty-dba
        severities: [critical]
      
      # Also send to Slack for visibility
      - name: slack-dba
        severities: [critical, warning]
    
    # Treat all backup issues as critical
    severityOverrides:
      jobFailed: critical
      deadManTriggered: critical
      slaBreached: critical

When to Use PagerDuty

Use PagerDuty for:
  • Critical backups: Data loss prevention
  • Revenue-impacting jobs: Payment processing, billing
  • Compliance-critical jobs: Audit logs, regulatory reports
  • Customer-facing jobs: Email delivery, notifications
Don’t page for:
  • Development environments: Use Slack instead
  • Non-critical reports: Warnings are sufficient
  • Flaky jobs: Fix the root cause first

PagerDuty + Slack Routing

Best practice: Route critical alerts to PagerDuty AND Slack for team awareness.
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: production-critical
  namespace: production
spec:
  selector:
    matchLabels:
      tier: critical
      environment: production
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  sla:
    enabled: true
    minSuccessRate: 99
  alerting:
    channelRefs:
      # Pages on-call engineer
      - name: pagerduty-oncall
        severities: [critical]
      
      # Notifies #incidents channel
      - name: slack-incidents
        severities: [critical]
      
      # Notifies #ops-alerts for all issues
      - name: slack-ops
        severities: [critical, warning]
Alert flow:
  1. Critical job failure occurs
  2. PagerDuty pages the on-call engineer
  3. Slack #incidents notifies the team
  4. Slack #ops-alerts provides visibility
  5. Engineer acknowledges in PagerDuty
  6. Team collaborates in Slack thread

Incident Details in PagerDuty

PagerDuty incidents created by CronJob Guardian include:
  • Title: CronJob Failed: production/daily-backup
  • Description: Job details, exit code, error message
  • Details: Kubernetes events, pod logs, suggested fixes
  • Links: Dashboard URL, namespace, CronJob manifest

Customizing Incident Content

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: detailed-pages
  namespace: production
spec:
  selector:
    matchLabels:
      tier: critical
  alerting:
    channelRefs:
      - name: pagerduty-oncall
        severities: [critical]
    
    # Include rich context in PagerDuty incidents
    includeContext:
      logs: true
      logLines: 200        # More logs for debugging
      events: true         # Kubernetes events
      podStatus: true      # Exit codes, container statuses
      suggestedFixes: true # Automated remediation suggestions

Alert Delay for PagerDuty

Avoid paging for transient failures by adding a delay:
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: with-retry-buffer
  namespace: production
spec:
  selector:
    matchLabels:
      tier: critical
  alerting:
    channelRefs:
      - name: pagerduty-oncall
        severities: [critical]
    
    # Wait 5 minutes before paging
    # If job retries and succeeds, page is cancelled
    alertDelay: 5m

How Alert Delay Works

  1. Job fails at 08:00:00
  2. Guardian waits until 08:05:00 before alerting
  3. If job retries and succeeds by 08:04:00, alert is cancelled
  4. If still failed at 08:05:00, PagerDuty incident is created
Set alertDelay to slightly longer than your CronJob’s backoffLimit retry window. For example:
  • CronJob with backoffLimit: 3 typically retries for ~3-5 minutes
  • Set alertDelay: 5m to allow retries to complete
  • Only page if retries are exhausted

Suppressing Duplicate Pages

Prevent repeat pages for the same failure:
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: no-duplicate-pages
  namespace: production
spec:
  selector:
    matchLabels:
      tier: critical
  alerting:
    channelRefs:
      - name: pagerduty-oncall
        severities: [critical]
    
    # Don't page again for 1 hour
    suppressDuplicatesFor: 1h
This prevents:
  • Paging every 5 minutes for the same failed job
  • Alert fatigue from recurring issues
  • Overwhelming on-call engineers during incidents

Testing PagerDuty Integration

1

Verify AlertChannel is ready

kubectl get alertchannel pagerduty-critical
kubectl describe alertchannel pagerduty-critical
Look for:
Status:
  Conditions:
    - Type: Ready
      Status: True
2

Create a test failing job

kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: CronJob
metadata:
  name: test-page
  namespace: production
  labels:
    tier: critical
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: test
              image: busybox
              command: ["sh", "-c", "echo 'Testing PagerDuty' && exit 1"]
          restartPolicy: Never
EOF
3

Wait for job failure and page

# Watch job execution
kubectl get jobs -n production -w
After the job fails, check PagerDuty for a new incident.
4

Verify incident in PagerDuty

  1. Log in to PagerDuty
  2. Go to Incidents
  3. Look for incident titled CronJob Failed: production/test-page
  4. Verify incident details include logs and context
5

Acknowledge and resolve

  1. Acknowledge the incident in PagerDuty
  2. Delete the test CronJob:
kubectl delete cronjob test-page -n production
  1. Resolve the incident in PagerDuty

Troubleshooting

Check routing key secret:
kubectl get secret pagerduty-key -n cronjob-guardian -o jsonpath='{.data.routing-key}' | base64 -d
Verify this matches your PagerDuty integration key.Check AlertChannel status:
kubectl describe alertchannel pagerduty-critical
Check controller logs:
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller-manager | grep -i pagerduty
The integration key may be invalid or revoked.
  1. Go to PagerDuty > Services > Your Service > Integrations
  2. Verify the Events API v2 integration exists
  3. Regenerate the integration key if needed
  4. Update the secret:
kubectl create secret generic pagerduty-key \
  --namespace cronjob-guardian \
  --from-literal=routing-key=<new-key> \
  --dry-run=client -o yaml | kubectl apply -f -
Use alert delays:
alerting:
  alertDelay: 5m  # Wait before paging
Suppress duplicates:
alerting:
  suppressDuplicatesFor: 1h
Only page for critical:
channelRefs:
  - name: pagerduty-oncall
    severities: [critical]  # No warnings
Use severity overrides:
severityOverrides:
  jobFailed: critical
  slaBreached: warning  # Don't page for SLA, only Slack
Ensure includeContext is configured:
alerting:
  includeContext:
    logs: true
    logLines: 200
    events: true
    podStatus: true
    suggestedFixes: true
Verify pods are still running when the alert fires (logs may be unavailable if pods are deleted quickly).

Best Practices

Critical Only

Only page for critical severity. Send warnings to Slack to avoid alert fatigue.

Use Alert Delays

Set alertDelay: 5m to allow job retries before paging on-call engineers.

Combine with Slack

Always route to both PagerDuty (for escalation) and Slack (for team visibility).

Suppress Duplicates

Use suppressDuplicatesFor: 1h to prevent repeat pages for the same issue.

Include Rich Context

Enable logs, events, and suggested fixes to help on-call engineers debug faster.

Test Integration

Regularly test with a failing CronJob to ensure pages reach the right people.

Example: Database Backup with PagerDuty

Complete real-world example:
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: database-backups
  namespace: databases
spec:
  selector:
    matchLabels:
      type: backup
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  sla:
    enabled: true
    minSuccessRate: 100
    maxDuration: 1h
  alerting:
    channelRefs:
      - name: pagerduty-dba
        severities: [critical]
      - name: slack-dba
        severities: [critical, warning]
    
    severityOverrides:
      jobFailed: critical
      deadManTriggered: critical
    
    alertDelay: 2m  # Allow brief retries
    suppressDuplicatesFor: 30m
    
    includeContext:
      logs: true
      logLines: 150
      events: true
      suggestedFixes: true
    
    suggestedFixPatterns:
      - name: disk-full
        match:
          logPattern: "No space left on device|disk full"
        suggestion: "Backup storage is full. Check PVC usage: kubectl get pvc -n {\{.Namespace}\}"
        priority: 150

Next Steps

Slack Alerts

Set up Slack notifications

Webhook Alerts

Integrate with custom systems

Advanced Monitoring

Configure SLA tracking and maintenance windows

Alert Channels Reference

Complete API documentation

Build docs developers (and LLMs) love