Overview
Beyond basic failure detection, CronJob Guardian provides advanced features for production reliability:
SLA Tracking : Monitor success rates over time
Duration Regression : Detect when jobs start taking longer
Maintenance Windows : Suppress alerts during planned maintenance
Suspended Handling : Manage monitoring of paused CronJobs
Custom Fix Suggestions : Provide automated remediation guidance
Database Backup Monitoring
Critical backup jobs require strict SLA enforcement and fast detection of issues.
monitors/database-backups.yaml
# Database Backup Monitoring
# Monitors critical backup jobs with strict SLA requirements
apiVersion : guardian.illenium.net/v1alpha1
kind : CronJobMonitor
metadata :
name : database-backups
namespace : databases
spec :
selector :
matchLabels :
type : backup
deadManSwitch :
enabled : true
maxTimeSinceLastSuccess : 25h # Daily backups with 1h buffer
sla :
enabled : true
minSuccessRate : 100 # Backups must never fail
maxDuration : 1h # Alert if backup takes too long
alerting :
channelRefs :
- name : pagerduty-dba
severities : [ critical ]
severityOverrides :
jobFailed : critical
deadManTriggered : critical
# Custom fix suggestion for backup failures
suggestedFixPatterns :
- name : disk-full
match :
logPattern : "No space left on device|disk full"
suggestion : "Backup storage is full. Check PVC usage: kubectl get pvc -n {\{.Namespace}\}"
priority : 150
What This Does
Enforces 100% success rate (any failure triggers an alert)
Alerts if backups take longer than 1 hour
Pages the DBA team immediately on any issue
Provides custom troubleshooting for disk-full errors
Setup Instructions
Label backup CronJobs
kubectl label cronjob postgres-backup type=backup -n databases
kubectl label cronjob mysql-backup type=backup -n databases
Create PagerDuty alert channel
kubectl create secret generic pagerduty-key \
--namespace cronjob-guardian \
--from-literal=routing-key= < dba-team-routing-key >
kubectl apply -f - << EOF
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
name: pagerduty-dba
spec:
type: pagerduty
pagerduty:
routingKeySecretRef:
name: pagerduty-key
namespace: cronjob-guardian
key: routing-key
EOF
Apply the monitor
kubectl apply -f database-backups.yaml
Verify SLA tracking
Check the monitor status for SLA metrics: kubectl get cronjobmonitor database-backups -n databases -o yaml
Look for: status :
sla :
currentSuccessRate : 100.0
recentExecutions : 24
failures : 0
A 100% SLA requirement means any failure triggers an alert . This is appropriate for critical backups but may be too strict for other workloads. For most jobs, 95-99% is more realistic.
ETL Pipeline with Duration Regression
Data pipelines should not only succeed but also run in a predictable time window. Detect performance degradation early.
monitors/data-pipeline.yaml
# Data Pipeline Monitoring
# Tracks ETL performance with duration regression detection
apiVersion : guardian.illenium.net/v1alpha1
kind : CronJobMonitor
metadata :
name : etl-pipeline
namespace : data-eng
spec :
selector :
matchLabels :
pipeline : etl
deadManSwitch :
enabled : true
autoFromSchedule :
enabled : true
buffer : 30m
sla :
enabled : true
minSuccessRate : 95
windowDays : 7
# ETL jobs have duration SLAs
maxDuration : 2h
durationRegressionThreshold : 25 # Alert if P95 increases by 25%
durationBaselineWindowDays : 14
alerting :
channelRefs :
- name : slack-data-eng
# Include logs to debug ETL failures
includeContext :
logs : true
logLines : 200
What This Does
Auto-detects expected run intervals from CronJob schedules
Alerts if jobs take longer than 2 hours (hard limit)
Detects duration regression if P95 duration increases by 25% compared to the last 14 days
Includes 200 lines of logs in alerts for debugging
Duration Regression Detection
This is powerful for catching performance degradation:
Guardian calculates a baseline P95 duration over the last 14 days
Compares recent runs to the baseline
Alerts if the P95 duration increases by more than 25%
Example:
Baseline P95: 45 minutes
Recent P95: 60 minutes
Increase: 33% (exceeds 25% threshold)
Result: Alert triggered
Tune durationRegressionThreshold based on your workload variability:
Stable workloads : 15-25% (detect small changes)
Variable workloads : 50-75% (avoid false positives)
Data-driven : Start at 50%, reduce as you understand variance
Setup Instructions
Label ETL CronJobs
kubectl label cronjob daily-etl pipeline=etl -n data-eng
kubectl label cronjob hourly-sync pipeline=etl -n data-eng
Apply the monitor
kubectl apply -f data-pipeline.yaml
Wait for baseline
Duration regression requires historical data. Wait at least 14 days (the baseline window) before regression detection activates. Check baseline status: kubectl get cronjobmonitor etl-pipeline -n data-eng -o jsonpath='{.status.sla.durationBaseline}'
Test regression detection
Simulate a slow job: # Add a sleep to your job temporarily
command : [ "/bin/sh" , "-c" , "sleep 3600 && /etl.sh" ]
After the job runs, check for a regression alert in your Slack channel.
Financial Reports with Maintenance Windows
Suppress alerts during planned downtime like month-end processing or system upgrades.
monitors/financial-reports.yaml
# Financial Report Monitoring
# Monitors business-critical reports with maintenance windows
apiVersion : guardian.illenium.net/v1alpha1
kind : CronJobMonitor
metadata :
name : financial-reports
namespace : finance
spec :
selector :
matchNames :
- daily-revenue-report
- weekly-summary
deadManSwitch :
enabled : true
maxTimeSinceLastSuccess : 25h
maintenanceWindows :
- name : quarter-end
schedule : "0 0 1 1,4,7,10 *" # First day of each quarter
duration : 24h
suppressAlerts : true
alerting :
channelRefs :
- name : slack-finance
What This Does
Monitors two specific reports by name
Defines a quarterly maintenance window on Jan 1, Apr 1, Jul 1, Oct 1
Suppresses alerts for 24 hours during quarter-end processing
Automatically resumes monitoring after the window
Maintenance Window Examples
Weekly Maintenance
Monthly Deployment Window
Daily Backup Window
Specific Dates (One-Time)
maintenanceWindows :
- name : weekly-maintenance
schedule : "0 2 * * 0" # Every Sunday at 2 AM
duration : 4h
timezone : America/New_York
suppressAlerts : true
Maintenance window schedules use the same cron syntax as CronJobs. Use crontab.guru to verify your expressions.
Suspended CronJob Handling
Control how monitoring behaves when CronJobs are manually suspended.
apiVersion : guardian.illenium.net/v1alpha1
kind : CronJobMonitor
metadata :
name : with-suspend-handling
namespace : production
spec :
selector :
matchLabels :
tier : critical
suspendedHandling :
pauseMonitoring : true # Pause monitoring when CronJob is suspended
alertIfSuspendedFor : 168h # Alert if suspended for more than 7 days
deadManSwitch :
enabled : true
maxTimeSinceLastSuccess : 25h
alerting :
channelRefs :
- name : slack-ops
What This Does
When a CronJob is suspended (.spec.suspend: true), monitoring pauses
No dead-man’s switch alerts while suspended
If the CronJob remains suspended for 7 days, send a reminder alert
Monitoring automatically resumes when unsuspended
Use Cases
Short-term suspension : Pause a job for debugging without triggering alerts
Long-term reminder : Detect forgotten suspended jobs
Planned downtime : Suspend jobs during migrations without alert noise
Full-Featured Configuration
A comprehensive example demonstrating all available options:
monitors/full-featured.yaml (excerpt)
apiVersion : guardian.illenium.net/v1alpha1
kind : CronJobMonitor
metadata :
name : full-featured
namespace : production
spec :
selector :
matchExpressions :
- key : tier
operator : In
values : [ critical , high ]
deadManSwitch :
enabled : true
autoFromSchedule :
enabled : true
buffer : 1h
missedScheduleThreshold : 2 # Alert after 2 missed schedules
sla :
enabled : true
minSuccessRate : 95
windowDays : 7
maxDuration : 30m
durationRegressionThreshold : 50
durationBaselineWindowDays : 14
suspendedHandling :
pauseMonitoring : true
alertIfSuspendedFor : 168h
maintenanceWindows :
- name : weekly-maintenance
schedule : "0 2 * * 0"
duration : 4h
timezone : America/New_York
suppressAlerts : true
alerting :
enabled : true
channelRefs :
- name : pagerduty-oncall
severities : [ critical ]
- name : slack-ops
severities : [ critical , warning ]
severityOverrides :
jobFailed : critical
slaBreached : warning
missedSchedule : warning
deadManTriggered : critical
durationRegression : warning
suppressDuplicatesFor : 1h
alertDelay : 5m
includeContext :
logs : true
logLines : 100
events : true
podStatus : true
suggestedFixes : true
suggestedFixPatterns :
- name : custom-oom
match :
exitCode : 137
suggestion : "Container was OOM killed. Consider increasing memory limits for {\{.Namespace}\}/{\{.Name}\}"
priority : 150
dataRetention :
retentionDays : 60
onCronJobDeletion : purge-after-days
purgeAfterDays : 7
storeLogs : true
logRetentionDays : 30
Key Features Explained
deadManSwitch :
autoFromSchedule :
enabled : true
buffer : 1h
missedScheduleThreshold : 2
Instead of hardcoding maxTimeSinceLastSuccess, Guardian calculates it from the CronJob’s schedule:
Schedule: 0 */6 * * * (every 6 hours)
Expected interval: 6 hours
With buffer: 7 hours
Alert after: 2 missed schedules = 14 hours
severityOverrides :
jobFailed : critical # Job failures are critical
slaBreached : warning # SLA breach is a warning
durationRegression : warning
Customize alert severity per alert type. Default severities:
jobFailed: critical
deadManTriggered: critical
slaBreached: warning
missedSchedule: warning
durationRegression: warning
Alert Delays and Deduplication
alerting :
alertDelay : 5m # Wait 5 minutes before sending
suppressDuplicatesFor : 1h # Don't resend same alert for 1 hour
Alert Delay : Waits 5 minutes before sending. If the issue resolves (e.g., job retries and succeeds), the alert is cancelled.
Suppress Duplicates : Prevents alert fatigue by not resending the same alert multiple times.
includeContext :
logs : true
logLines : 100
events : true
podStatus : true
suggestedFixes : true
Alerts include:
Last 100 lines of pod logs
Kubernetes events related to the job
Pod status (exit codes, reasons)
AI-generated fix suggestions based on error patterns
suggestedFixPatterns :
- name : custom-oom
match :
exitCode : 137
suggestion : "Container was OOM killed. Increase memory limits."
priority : 150
Define custom troubleshooting advice based on:
Exit codes
Log patterns (regex)
Error messages
Higher priority patterns (>100) override built-in suggestions.
dataRetention :
retentionDays : 60
onCronJobDeletion : purge-after-days
purgeAfterDays : 7
Keep execution history for 60 days
When a CronJob is deleted, retain data for 7 more days before purging
Useful for post-mortem analysis of deleted jobs
Common Patterns
High-Availability Jobs
sla :
enabled : true
minSuccessRate : 99.9
windowDays : 30 # Longer window for accurate percentage
maxDuration : 15m
alertring :
channelRefs :
- name : pagerduty-critical
severities : [ critical ]
alertDelay : 0 # No delay, page immediately
Variable-Duration Jobs
sla :
enabled : true
# Don't set maxDuration for jobs with variable runtime
durationRegressionThreshold : 100 # Alert if duration doubles
durationBaselineWindowDays : 30 # Longer baseline for stability
Development Environment
sla :
enabled : true
minSuccessRate : 80 # More lenient
windowDays : 7
alertring :
channelRefs :
- name : slack-dev
severities : [ critical ] # Only critical alerts
suppressDuplicatesFor : 6h # Reduce noise
Next Steps
Slack Alerts Configure Slack alert channels
PagerDuty Alerts Set up on-call escalation
Webhook Alerts Integrate with custom systems
Monitor Reference Complete API documentation