Overview
A dead-man’s switch is a safety mechanism that triggers an alert when an expected event doesn’t happen. For CronJobs, this means alerting when a job fails to run on its schedule.Traditional monitoring waits for failures. Dead-man’s switch monitoring detects absence of success, catching:
- Misconfigured schedules
- Resource quota exhaustion preventing job creation
- Controller or cluster issues
- Accidentally suspended CronJobs
How It Works
The dead-man’s switch analyzer checks if enough time has elapsed since the last successful job execution:Calculate Expected Interval
Determine how often the job should run, either from:
- Fixed interval:
maxTimeSinceLastSuccess(e.g.,25hfor daily jobs) - Auto-detected: Parse cron schedule and add buffer
Configuration
Fixed Interval
Specify a fixed time window:Alert if no successful execution within this duration.Choosing the right value:
- For daily jobs (
0 0 * * *): Use25h(24h schedule + 1h buffer) - For hourly jobs (
0 * * * *): Use75m(60m schedule + 15m buffer) - For weekly jobs (
0 0 * * 0): Use169h(168h schedule + 1h buffer)
Auto-Detection from Schedule
Automatically parse the cron schedule:Enable auto-detection from the CronJob’s
schedule field.Extra time added to the detected interval.For a daily job (
0 0 * * *), the detected interval is 24h. With a 1h buffer, the total expected interval is 25h.Number of missed schedules before alerting.Set to
2 to allow one missed run (useful for flaky jobs):Schedule Parsing
The analyzer parses cron expressions using the standard 5-field format:Caching
Schedule parsing is expensive. The analyzer uses an LRU cache to avoid repeated parsing:0 0 * * *).
Alert Behavior
When the dead-man’s switch triggers:Alert Message
- Time elapsed since last run
- Number of missed schedules
- Configured threshold
- Expected interval
Alert Severity
Default severity iscritical. Override via:
Duplicate Suppression
Once triggered, the alert remains active until:- A job succeeds (clears the alert)
- The alert is manually cleared
- The monitor is deleted
Edge Cases
No Execution History
For newly created CronJobs with no execution history:Suspended CronJobs
IfsuspendedHandling.pauseMonitoring is true (default), dead-man’s switch checks are skipped for suspended CronJobs to avoid false alarms.
Timezone Handling
The analyzer uses the CronJob’sspec.timeZone field (Kubernetes 1.25+) when parsing schedules:
Examples
Critical Daily Backup
Flexible Reporting Jobs
High-Frequency Jobs
Monitoring the Monitor
Expose dead-man’s switch metrics via Prometheus:Troubleshooting
False alarms for irregular schedules
False alarms for irregular schedules
Problem: Auto-detection calculates the wrong interval for schedules like
0 0 1 * * (monthly).Solution: Use fixed maxTimeSinceLastSuccess:Alerts during planned maintenance
Alerts during planned maintenance
Problem: Dead-man’s switch triggers during scheduled downtime.Solution: Configure maintenance windows:
Immediate alert after CronJob creation
Immediate alert after CronJob creation
Problem: Alert fires immediately for a brand new CronJob.Solution: The analyzer waits for
expectedInterval to elapse from creation time before alerting. If you see immediate alerts, check:- Is the CronJob actually running? (
kubectl get jobs) - Is the schedule valid? (
kubectl describe cronjob)
Next Steps
SLA Tracking
Monitor success rates and detect regressions
Alert Configuration
Customize alert behavior and routing
Suggested Fixes
Automatically suggest remediation actions