PagerDuty Integration

Overview

PagerDuty integration allows you to escalate critical CronJob failures to your on-call engineers. This is essential for:

24/7 monitoring of business-critical jobs
Automatic escalation if alerts aren’t acknowledged
Integration with on-call schedules and rotation
Incident tracking and post-mortems

Quick Start

Get PagerDuty Routing Key

Log in to your PagerDuty account
Go to Services > Service Directory
Select or create a service (e.g., “CronJob Failures”)
Go to Integrations tab
Click Add Integration
Select Events API v2
Copy the Integration Key (routing key)

Create Kubernetes Secret

Store the routing key in a secret:

kubectl create secret generic pagerduty-key \
  --namespace cronjob-guardian \
  --from-literal=routing-key=<your-integration-key>

Create PagerDuty AlertChannel

kubectl apply -f - <<EOF
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
  name: pagerduty-critical
spec:
  type: pagerduty
  pagerduty:
    routingKeySecretRef:
      name: pagerduty-key
      namespace: cronjob-guardian
      key: routing-key
    severity: critical
EOF

Reference in CronJobMonitor

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: critical-jobs
  namespace: production
spec:
  selector:
    matchLabels:
      tier: critical
  alerting:
    channelRefs:
      - name: pagerduty-critical
        severities: [critical]  # Only page for critical alerts

Basic PagerDuty AlertChannel

Here’s the example from the repository:

alertchannels/pagerduty.yaml

# PagerDuty AlertChannel
# Sends critical alerts to PagerDuty for on-call escalation
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
  name: pagerduty-critical
spec:
  type: pagerduty
  pagerduty:
    routingKeySecretRef:
      name: pagerduty-key
      namespace: cronjob-guardian
      key: routing-key
    severity: critical

Configuration Options

pagerduty.routingKeySecretRef

object

required

Reference to a Kubernetes Secret containing the PagerDuty integration key

Show properties

name

string

required

Name of the Secret

namespace

string

required

Namespace where the Secret exists

key

string

required

Key within the Secret (usually routing-key)

pagerduty.severity

string

Default PagerDuty severity level: critical, error, warning, or infoTypically set to critical for on-call escalation.

Multi-Team PagerDuty Setup

Create separate PagerDuty services and AlertChannels for different teams:

apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
  name: pagerduty-dba
spec:
  type: pagerduty
  pagerduty:
    routingKeySecretRef:
      name: pagerduty-dba-key
      namespace: cronjob-guardian
      key: routing-key
    severity: critical

Creating Team-Specific Secrets

# DBA team
kubectl create secret generic pagerduty-dba-key \
  --namespace cronjob-guardian \
  --from-literal=routing-key=<dba-integration-key>

# Platform team
kubectl create secret generic pagerduty-platform-key \
  --namespace cronjob-guardian \
  --from-literal=routing-key=<platform-integration-key>

# Data engineering team
kubectl create secret generic pagerduty-data-key \
  --namespace cronjob-guardian \
  --from-literal=routing-key=<data-integration-key>

Critical Job Monitoring with PagerDuty

Escalate only critical failures to PagerDuty, while sending all alerts to Slack:

Database Backups

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: database-backups
  namespace: databases
spec:
  selector:
    matchLabels:
      type: backup
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  sla:
    enabled: true
    minSuccessRate: 100  # Backups must never fail
    maxDuration: 1h
  alerting:
    channelRefs:
      # Page on-call DBA for critical issues
      - name: pagerduty-dba
        severities: [critical]
      
      # Also send to Slack for visibility
      - name: slack-dba
        severities: [critical, warning]
    
    # Treat all backup issues as critical
    severityOverrides:
      jobFailed: critical
      deadManTriggered: critical
      slaBreached: critical

When to Use PagerDuty

Use PagerDuty for:

Critical backups: Data loss prevention
Revenue-impacting jobs: Payment processing, billing
Compliance-critical jobs: Audit logs, regulatory reports
Customer-facing jobs: Email delivery, notifications

Don’t page for:

Development environments: Use Slack instead
Non-critical reports: Warnings are sufficient
Flaky jobs: Fix the root cause first

PagerDuty + Slack Routing

Best practice: Route critical alerts to PagerDuty AND Slack for team awareness.

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: production-critical
  namespace: production
spec:
  selector:
    matchLabels:
      tier: critical
      environment: production
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  sla:
    enabled: true
    minSuccessRate: 99
  alerting:
    channelRefs:
      # Pages on-call engineer
      - name: pagerduty-oncall
        severities: [critical]
      
      # Notifies #incidents channel
      - name: slack-incidents
        severities: [critical]
      
      # Notifies #ops-alerts for all issues
      - name: slack-ops
        severities: [critical, warning]

Alert flow:

Critical job failure occurs
PagerDuty pages the on-call engineer
Slack #incidents notifies the team
Slack #ops-alerts provides visibility
Engineer acknowledges in PagerDuty
Team collaborates in Slack thread

Incident Details in PagerDuty

PagerDuty incidents created by CronJob Guardian include:

Title: CronJob Failed: production/daily-backup
Description: Job details, exit code, error message
Details: Kubernetes events, pod logs, suggested fixes
Links: Dashboard URL, namespace, CronJob manifest

Customizing Incident Content

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: detailed-pages
  namespace: production
spec:
  selector:
    matchLabels:
      tier: critical
  alerting:
    channelRefs:
      - name: pagerduty-oncall
        severities: [critical]
    
    # Include rich context in PagerDuty incidents
    includeContext:
      logs: true
      logLines: 200        # More logs for debugging
      events: true         # Kubernetes events
      podStatus: true      # Exit codes, container statuses
      suggestedFixes: true # Automated remediation suggestions

Alert Delay for PagerDuty

Avoid paging for transient failures by adding a delay:

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: with-retry-buffer
  namespace: production
spec:
  selector:
    matchLabels:
      tier: critical
  alerting:
    channelRefs:
      - name: pagerduty-oncall
        severities: [critical]
    
    # Wait 5 minutes before paging
    # If job retries and succeeds, page is cancelled
    alertDelay: 5m

How Alert Delay Works

Job fails at 08:00:00
Guardian waits until 08:05:00 before alerting
If job retries and succeeds by 08:04:00, alert is cancelled
If still failed at 08:05:00, PagerDuty incident is created

Set alertDelay to slightly longer than your CronJob’s backoffLimit retry window. For example:

CronJob with backoffLimit: 3 typically retries for ~3-5 minutes
Set alertDelay: 5m to allow retries to complete
Only page if retries are exhausted

Suppressing Duplicate Pages

Prevent repeat pages for the same failure:

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: no-duplicate-pages
  namespace: production
spec:
  selector:
    matchLabels:
      tier: critical
  alerting:
    channelRefs:
      - name: pagerduty-oncall
        severities: [critical]
    
    # Don't page again for 1 hour
    suppressDuplicatesFor: 1h

This prevents:

Paging every 5 minutes for the same failed job
Alert fatigue from recurring issues
Overwhelming on-call engineers during incidents

Testing PagerDuty Integration

Verify AlertChannel is ready

kubectl get alertchannel pagerduty-critical
kubectl describe alertchannel pagerduty-critical

Look for:

Status:
  Conditions:
    - Type: Ready
      Status: True

Create a test failing job

kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: CronJob
metadata:
  name: test-page
  namespace: production
  labels:
    tier: critical
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: test
              image: busybox
              command: ["sh", "-c", "echo 'Testing PagerDuty' && exit 1"]
          restartPolicy: Never
EOF

Wait for job failure and page

# Watch job execution
kubectl get jobs -n production -w

After the job fails, check PagerDuty for a new incident.

Verify incident in PagerDuty

Log in to PagerDuty
Go to Incidents
Look for incident titled CronJob Failed: production/test-page
Verify incident details include logs and context

Acknowledge and resolve

Acknowledge the incident in PagerDuty
Delete the test CronJob:

kubectl delete cronjob test-page -n production

Resolve the incident in PagerDuty

Troubleshooting

No incidents created in PagerDuty

Check routing key secret:

kubectl get secret pagerduty-key -n cronjob-guardian -o jsonpath='{.data.routing-key}' | base64 -d

Verify this matches your PagerDuty integration key.Check AlertChannel status:

kubectl describe alertchannel pagerduty-critical

Check controller logs:

kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller-manager | grep -i pagerduty

PagerDuty returns 403 or 404

The integration key may be invalid or revoked.

Go to PagerDuty > Services > Your Service > Integrations
Verify the Events API v2 integration exists
Regenerate the integration key if needed
Update the secret:

kubectl create secret generic pagerduty-key \
  --namespace cronjob-guardian \
  --from-literal=routing-key=<new-key> \
  --dry-run=client -o yaml | kubectl apply -f -

Too many pages

Use alert delays:

alerting:
  alertDelay: 5m  # Wait before paging

Suppress duplicates:

alerting:
  suppressDuplicatesFor: 1h

Only page for critical:

channelRefs:
  - name: pagerduty-oncall
    severities: [critical]  # No warnings

Use severity overrides:

severityOverrides:
  jobFailed: critical
  slaBreached: warning  # Don't page for SLA, only Slack

Missing context in incidents

Ensure includeContext is configured:

alerting:
  includeContext:
    logs: true
    logLines: 200
    events: true
    podStatus: true
    suggestedFixes: true

Verify pods are still running when the alert fires (logs may be unavailable if pods are deleted quickly).

Best Practices

Critical Only

Only page for critical severity. Send warnings to Slack to avoid alert fatigue.

Use Alert Delays

Set alertDelay: 5m to allow job retries before paging on-call engineers.

Combine with Slack

Always route to both PagerDuty (for escalation) and Slack (for team visibility).

Suppress Duplicates

Use suppressDuplicatesFor: 1h to prevent repeat pages for the same issue.

Include Rich Context

Enable logs, events, and suggested fixes to help on-call engineers debug faster.

Test Integration

Regularly test with a failing CronJob to ensure pages reach the right people.

Example: Database Backup with PagerDuty

Complete real-world example:

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: database-backups
  namespace: databases
spec:
  selector:
    matchLabels:
      type: backup
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  sla:
    enabled: true
    minSuccessRate: 100
    maxDuration: 1h
  alerting:
    channelRefs:
      - name: pagerduty-dba
        severities: [critical]
      - name: slack-dba
        severities: [critical, warning]
    
    severityOverrides:
      jobFailed: critical
      deadManTriggered: critical
    
    alertDelay: 2m  # Allow brief retries
    suppressDuplicatesFor: 30m
    
    includeContext:
      logs: true
      logLines: 150
      events: true
      suggestedFixes: true
    
    suggestedFixPatterns:
      - name: disk-full
        match:
          logPattern: "No space left on device|disk full"
        suggestion: "Backup storage is full. Check PVC usage: kubectl get pvc -n {\{.Namespace}\}"
        priority: 150

Next Steps

Slack Alerts

Set up Slack notifications

Webhook Alerts

Integrate with custom systems

Advanced Monitoring

Configure SLA tracking and maintenance windows

Alert Channels Reference

Complete API documentation

Monitors

Alert Channels

Overview

Quick Start

Basic PagerDuty AlertChannel

Configuration Options

Multi-Team PagerDuty Setup

Creating Team-Specific Secrets

Critical Job Monitoring with PagerDuty

When to Use PagerDuty

PagerDuty + Slack Routing

Incident Details in PagerDuty

Customizing Incident Content

Alert Delay for PagerDuty

How Alert Delay Works

Suppressing Duplicate Pages

Testing PagerDuty Integration

Troubleshooting

Best Practices

Critical Only

Use Alert Delays

Combine with Slack

Suppress Duplicates

Include Rich Context

Test Integration

Example: Database Backup with PagerDuty

Next Steps

Slack Alerts

Webhook Alerts

Advanced Monitoring

Alert Channels Reference

Build docs developers (and LLMs) love

Monitors

Alert Channels

​Overview

​Quick Start

​Basic PagerDuty AlertChannel

​Configuration Options

​Multi-Team PagerDuty Setup

​Creating Team-Specific Secrets

​Critical Job Monitoring with PagerDuty

​When to Use PagerDuty

​PagerDuty + Slack Routing

​Incident Details in PagerDuty

​Customizing Incident Content

​Alert Delay for PagerDuty

​How Alert Delay Works

​Suppressing Duplicate Pages

​Testing PagerDuty Integration

​Troubleshooting

​Best Practices

Critical Only

Use Alert Delays

Combine with Slack

Suppress Duplicates

Include Rich Context

Test Integration

​Example: Database Backup with PagerDuty

​Next Steps

Slack Alerts

Webhook Alerts

Advanced Monitoring

Alert Channels Reference

Build docs developers (and LLMs) love

Overview

Quick Start

Basic PagerDuty AlertChannel

Configuration Options

Multi-Team PagerDuty Setup

Creating Team-Specific Secrets

Critical Job Monitoring with PagerDuty

When to Use PagerDuty

PagerDuty + Slack Routing

Incident Details in PagerDuty

Customizing Incident Content

Alert Delay for PagerDuty

How Alert Delay Works

Suppressing Duplicate Pages

Testing PagerDuty Integration

Troubleshooting

Best Practices

Example: Database Backup with PagerDuty

Next Steps