Introduction
CronJob Guardian is a Kubernetes operator that provides comprehensive monitoring, SLA tracking, and intelligent alerting for CronJobs. It acts as a vigilant watchdog for your scheduled workloads, detecting failures, performance regressions, and missed schedules before they impact your business.How It Works
CronJob Guardian operates through a reconciliation loop that continuously monitors your CronJobs:Core Components
CronJobMonitor Controller
Reconciles CronJobMonitor resources, discovers matching CronJobs, and orchestrates monitoring workflows.
Job Handler
Watches Job completions, records execution history, and extracts failure context (logs, events, exit codes).
SLA Analyzer
Calculates success rates, duration percentiles (P50/P95/P99), and detects performance regressions.
Alert Dispatcher
Routes alerts to configured channels with deduplication, rate limiting, and delayed dispatch support.
Reconciliation Flow
The controller reconciles every 30 seconds and on CronJob changes:Selector Evaluation
The controller evaluates the monitor’s
selector to find matching CronJobs. Selectors support:- Label matching (
matchLabels,matchExpressions) - Explicit name lists (
matchNames) - Multi-namespace monitoring (
namespaces,namespaceSelector,allNamespaces)
Execution History Lookup
For each matched CronJob, the controller queries the database for recent execution history including:
- Last successful execution time and duration
- Last failed execution with exit code and reason
- Success rate over the configured window (default: 7 days)
- Duration percentiles for performance tracking
Alert Condition Checks
The analyzer checks multiple conditions in parallel:
- Dead-Man’s Switch: Has a job not run within the expected interval?
- Job Failures: Did the last execution fail?
- SLA Violations: Is success rate below threshold? Did duration exceed max?
- Performance Regression: Has P95 duration increased significantly?
Alert Dispatch
When conditions trigger, alerts are dispatched through the configured channels:
- Duplicate suppression prevents alert storms (default: 1 hour window)
- Alert delay allows transient issues to self-resolve
- Suggested fixes provide actionable remediation steps
Data Flow
Execution Recording
When a Job completes, the Job Handler:- Identifies the parent CronJob via owner references
- Extracts execution metadata:
- Start and completion times
- Success/failure status
- Exit code and termination reason
- Gathers contextual data:
- Pod logs (configurable lines, default: 50)
- Kubernetes events related to the pod
- Container statuses and resource usage
- Generates suggested fixes by matching failure patterns:
- OOM kills → increase memory limits
- Connection timeouts → check network policies
- Exit code 143 → graceful shutdown timeout
- Stores execution record in the database for historical analysis
Metrics Calculation
The SLA Analyzer computes metrics on-demand during reconciliation:Resource Relationships
A single CronJobMonitor can watch multiple CronJobs, even across namespaces. This enables centralized monitoring policies for related workloads.
Controller Architecture
The controller follows Kubernetes operator best practices:Finalizers
CronJobMonitors use finalizers (guardian.illenium.net/finalizer) to ensure graceful cleanup:
- Clear all pending alerts for the monitor
- Optionally purge execution history based on
dataRetention.onCronJobDeletion
Status Updates
Status updates use optimistic concurrency control with retry logic:- The controller detects mid-reconcile spec changes via generation tracking
- Conflicts trigger immediate requeues to recompute with fresh data
- Status reflects the observed generation to indicate reconciliation state
Watch Triggers
The controller reconciles on:- CronJobMonitor changes: Spec updates (generation changes)
- CronJob changes: Creation, updates, deletions of matched CronJobs
- Periodic requeues: Every 30 seconds for continuous monitoring
Performance Considerations
Database Queries
Database Queries
The controller executes one query per CronJob per reconciliation to fetch:
- Last execution (for failure detection)
- Last successful execution (for dead-man’s switch)
- Metrics over the configured window (for SLA tracking)
(cronjob_namespace, cronjob_name, start_time) for efficiency.Schedule Parsing Cache
Schedule Parsing Cache
Cron schedule parsing is expensive. The analyzer maintains an LRU cache (max 1000 entries) of parsed schedules to avoid repeated parsing:
Alert Deduplication
Alert Deduplication
The dispatcher maintains in-memory maps of sent alerts with periodic cleanup:
sentAlerts: Tracks last sent time for duplicate suppressionactiveAlerts: Stores active alert details for error signature comparison- Cleanup runs hourly to remove alerts older than 24 hours
Next Steps
CronJobMonitor CRD
Deep dive into all CronJobMonitor fields and configuration options
AlertChannel CRD
Learn how to configure Slack, PagerDuty, email, and webhook alerts
Dead-Man's Switch
Understand how missed schedule detection works
SLA Tracking
Explore success rate tracking and regression detection