Overview
PagerDuty integration allows you to escalate critical CronJob failures to your on-call engineers. This is essential for:
- 24/7 monitoring of business-critical jobs
- Automatic escalation if alerts aren’t acknowledged
- Integration with on-call schedules and rotation
- Incident tracking and post-mortems
Quick Start
Get PagerDuty Routing Key
- Log in to your PagerDuty account
- Go to Services > Service Directory
- Select or create a service (e.g., “CronJob Failures”)
- Go to Integrations tab
- Click Add Integration
- Select Events API v2
- Copy the Integration Key (routing key)
Create Kubernetes Secret
Store the routing key in a secret:kubectl create secret generic pagerduty-key \
--namespace cronjob-guardian \
--from-literal=routing-key=<your-integration-key>
Create PagerDuty AlertChannel
kubectl apply -f - <<EOF
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
name: pagerduty-critical
spec:
type: pagerduty
pagerduty:
routingKeySecretRef:
name: pagerduty-key
namespace: cronjob-guardian
key: routing-key
severity: critical
EOF
Reference in CronJobMonitor
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: critical-jobs
namespace: production
spec:
selector:
matchLabels:
tier: critical
alerting:
channelRefs:
- name: pagerduty-critical
severities: [critical] # Only page for critical alerts
Here’s the example from the repository:
alertchannels/pagerduty.yaml
# PagerDuty AlertChannel
# Sends critical alerts to PagerDuty for on-call escalation
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
name: pagerduty-critical
spec:
type: pagerduty
pagerduty:
routingKeySecretRef:
name: pagerduty-key
namespace: cronjob-guardian
key: routing-key
severity: critical
Configuration Options
Reference to a Kubernetes Secret containing the PagerDuty integration key
Namespace where the Secret exists
Key within the Secret (usually routing-key)
Default PagerDuty severity level: critical, error, warning, or infoTypically set to critical for on-call escalation.
Create separate PagerDuty services and AlertChannels for different teams:
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
name: pagerduty-dba
spec:
type: pagerduty
pagerduty:
routingKeySecretRef:
name: pagerduty-dba-key
namespace: cronjob-guardian
key: routing-key
severity: critical
Creating Team-Specific Secrets
# DBA team
kubectl create secret generic pagerduty-dba-key \
--namespace cronjob-guardian \
--from-literal=routing-key=<dba-integration-key>
# Platform team
kubectl create secret generic pagerduty-platform-key \
--namespace cronjob-guardian \
--from-literal=routing-key=<platform-integration-key>
# Data engineering team
kubectl create secret generic pagerduty-data-key \
--namespace cronjob-guardian \
--from-literal=routing-key=<data-integration-key>
Escalate only critical failures to PagerDuty, while sending all alerts to Slack:
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: database-backups
namespace: databases
spec:
selector:
matchLabels:
type: backup
deadManSwitch:
enabled: true
maxTimeSinceLastSuccess: 25h
sla:
enabled: true
minSuccessRate: 100 # Backups must never fail
maxDuration: 1h
alerting:
channelRefs:
# Page on-call DBA for critical issues
- name: pagerduty-dba
severities: [critical]
# Also send to Slack for visibility
- name: slack-dba
severities: [critical, warning]
# Treat all backup issues as critical
severityOverrides:
jobFailed: critical
deadManTriggered: critical
slaBreached: critical
Use PagerDuty for:
- Critical backups: Data loss prevention
- Revenue-impacting jobs: Payment processing, billing
- Compliance-critical jobs: Audit logs, regulatory reports
- Customer-facing jobs: Email delivery, notifications
Don’t page for:
- Development environments: Use Slack instead
- Non-critical reports: Warnings are sufficient
- Flaky jobs: Fix the root cause first
Best practice: Route critical alerts to PagerDuty AND Slack for team awareness.
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: production-critical
namespace: production
spec:
selector:
matchLabels:
tier: critical
environment: production
deadManSwitch:
enabled: true
maxTimeSinceLastSuccess: 25h
sla:
enabled: true
minSuccessRate: 99
alerting:
channelRefs:
# Pages on-call engineer
- name: pagerduty-oncall
severities: [critical]
# Notifies #incidents channel
- name: slack-incidents
severities: [critical]
# Notifies #ops-alerts for all issues
- name: slack-ops
severities: [critical, warning]
Alert flow:
- Critical job failure occurs
- PagerDuty pages the on-call engineer
- Slack #incidents notifies the team
- Slack #ops-alerts provides visibility
- Engineer acknowledges in PagerDuty
- Team collaborates in Slack thread
PagerDuty incidents created by CronJob Guardian include:
- Title:
CronJob Failed: production/daily-backup
- Description: Job details, exit code, error message
- Details: Kubernetes events, pod logs, suggested fixes
- Links: Dashboard URL, namespace, CronJob manifest
Customizing Incident Content
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: detailed-pages
namespace: production
spec:
selector:
matchLabels:
tier: critical
alerting:
channelRefs:
- name: pagerduty-oncall
severities: [critical]
# Include rich context in PagerDuty incidents
includeContext:
logs: true
logLines: 200 # More logs for debugging
events: true # Kubernetes events
podStatus: true # Exit codes, container statuses
suggestedFixes: true # Automated remediation suggestions
Avoid paging for transient failures by adding a delay:
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: with-retry-buffer
namespace: production
spec:
selector:
matchLabels:
tier: critical
alerting:
channelRefs:
- name: pagerduty-oncall
severities: [critical]
# Wait 5 minutes before paging
# If job retries and succeeds, page is cancelled
alertDelay: 5m
How Alert Delay Works
- Job fails at 08:00:00
- Guardian waits until 08:05:00 before alerting
- If job retries and succeeds by 08:04:00, alert is cancelled
- If still failed at 08:05:00, PagerDuty incident is created
Set alertDelay to slightly longer than your CronJob’s backoffLimit retry window. For example:
- CronJob with
backoffLimit: 3 typically retries for ~3-5 minutes
- Set
alertDelay: 5m to allow retries to complete
- Only page if retries are exhausted
Suppressing Duplicate Pages
Prevent repeat pages for the same failure:
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: no-duplicate-pages
namespace: production
spec:
selector:
matchLabels:
tier: critical
alerting:
channelRefs:
- name: pagerduty-oncall
severities: [critical]
# Don't page again for 1 hour
suppressDuplicatesFor: 1h
This prevents:
- Paging every 5 minutes for the same failed job
- Alert fatigue from recurring issues
- Overwhelming on-call engineers during incidents
Verify AlertChannel is ready
kubectl get alertchannel pagerduty-critical
kubectl describe alertchannel pagerduty-critical
Look for:Status:
Conditions:
- Type: Ready
Status: True
Create a test failing job
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: CronJob
metadata:
name: test-page
namespace: production
labels:
tier: critical
spec:
schedule: "*/5 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: test
image: busybox
command: ["sh", "-c", "echo 'Testing PagerDuty' && exit 1"]
restartPolicy: Never
EOF
Wait for job failure and page
# Watch job execution
kubectl get jobs -n production -w
After the job fails, check PagerDuty for a new incident.Verify incident in PagerDuty
- Log in to PagerDuty
- Go to Incidents
- Look for incident titled
CronJob Failed: production/test-page
- Verify incident details include logs and context
Acknowledge and resolve
- Acknowledge the incident in PagerDuty
- Delete the test CronJob:
kubectl delete cronjob test-page -n production
- Resolve the incident in PagerDuty
Troubleshooting
No incidents created in PagerDuty
PagerDuty returns 403 or 404
Use alert delays:alerting:
alertDelay: 5m # Wait before paging
Suppress duplicates:alerting:
suppressDuplicatesFor: 1h
Only page for critical:channelRefs:
- name: pagerduty-oncall
severities: [critical] # No warnings
Use severity overrides:severityOverrides:
jobFailed: critical
slaBreached: warning # Don't page for SLA, only Slack
Missing context in incidents
Ensure includeContext is configured:alerting:
includeContext:
logs: true
logLines: 200
events: true
podStatus: true
suggestedFixes: true
Verify pods are still running when the alert fires (logs may be unavailable if pods are deleted quickly).
Best Practices
Critical Only
Only page for critical severity. Send warnings to Slack to avoid alert fatigue.
Use Alert Delays
Set alertDelay: 5m to allow job retries before paging on-call engineers.
Combine with Slack
Always route to both PagerDuty (for escalation) and Slack (for team visibility).
Suppress Duplicates
Use suppressDuplicatesFor: 1h to prevent repeat pages for the same issue.
Include Rich Context
Enable logs, events, and suggested fixes to help on-call engineers debug faster.
Test Integration
Regularly test with a failing CronJob to ensure pages reach the right people.
Complete real-world example:
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: database-backups
namespace: databases
spec:
selector:
matchLabels:
type: backup
deadManSwitch:
enabled: true
maxTimeSinceLastSuccess: 25h
sla:
enabled: true
minSuccessRate: 100
maxDuration: 1h
alerting:
channelRefs:
- name: pagerduty-dba
severities: [critical]
- name: slack-dba
severities: [critical, warning]
severityOverrides:
jobFailed: critical
deadManTriggered: critical
alertDelay: 2m # Allow brief retries
suppressDuplicatesFor: 30m
includeContext:
logs: true
logLines: 150
events: true
suggestedFixes: true
suggestedFixPatterns:
- name: disk-full
match:
logPattern: "No space left on device|disk full"
suggestion: "Backup storage is full. Check PVC usage: kubectl get pvc -n {\{.Namespace}\}"
priority: 150
Next Steps
Slack Alerts
Set up Slack notifications
Webhook Alerts
Integrate with custom systems
Advanced Monitoring
Configure SLA tracking and maintenance windows
Alert Channels Reference
Complete API documentation