Overview
TheCronJobMonitor is the primary custom resource for defining monitoring policies. It specifies:
- Which CronJobs to monitor (via selectors)
- What conditions trigger alerts (dead-man’s switch, SLA thresholds)
- Where to send alerts (via channel references)
- How to handle data retention and lifecycle
Basic Example
Spec Fields
selector
Defines which CronJobs to monitor. An empty selector matches all CronJobs in the monitor’s namespace.matchLabels
matchLabels
Selects CronJobs with exact label matches:All specified labels must match (logical AND).
matchExpressions
matchExpressions
Selects CronJobs using label selector expressions:Supported operators:
In, NotIn, Exists, DoesNotExistmatchNames
matchNames
Explicitly lists CronJob names to monitor:Only valid when watching a single namespace. Combines with label selectors using AND logic.
namespaces
namespaces
Specifies namespaces to watch:The monitor will watch CronJobs matching label selectors in any of these namespaces.
namespaceSelector
namespaceSelector
Selects namespaces dynamically by labels:The monitor will watch all namespaces with
monitoring=enabled label.allNamespaces
allNamespaces
Enables cluster-wide monitoring:Watches all CronJobs matching label selectors across all namespaces. Takes precedence over
namespaces and namespaceSelector.deadManSwitch
Configures dead-man’s switch alerting when jobs don’t run on time.enabled
enabled
Enables/disables dead-man’s switch monitoring. Default:
truemaxTimeSinceLastSuccess
maxTimeSinceLastSuccess
Fixed time window for detecting missed runs:Use this for jobs with predictable schedules. For a daily job at midnight, set to
25h (24h + 1h buffer).autoFromSchedule
autoFromSchedule
Auto-calculates expected interval from the CronJob’s schedule:The controller parses the cron expression (e.g.,
0 0 * * *) and calculates the interval between runs. The buffer accounts for execution time and scheduling delays.sla
Configures SLA tracking for success rates and execution durations.Enables SLA tracking and alerting.
Minimum acceptable success rate percentage (0-100).
Rolling window size in days for calculating success rate.
Alert if any job execution exceeds this duration.
Window size for calculating baseline duration for regression detection.
suspendedHandling
Configures behavior when CronJobs are suspended.Suspended CronJobs don’t create new Jobs. Setting
pauseMonitoring: true prevents false alarms from dead-man’s switch during planned suspensions.maintenanceWindows
Defines scheduled maintenance periods to suppress alerts.Cron expression for when the window starts.
How long the maintenance window lasts.
Timezone for the schedule (e.g.,
America/New_York, Europe/London).Whether to suppress alert dispatch during this window.
alerting
Configures alert channels and behavior.channelRefs
channelRefs
References to
AlertChannel CRs (cluster-scoped):Only
critical and warning severities are supported. Info-level notifications are not part of the alerting model.suppressDuplicatesFor
suppressDuplicatesFor
Time window to suppress duplicate alerts:Duplicate suppression is bypassed if the error signature changes (e.g., OOM → connection timeout).
alertDelay
alertDelay
Delays alert dispatch to allow transient issues to resolve:If the job succeeds before the delay expires, the pending alert is cancelled. Useful for flaky jobs.
severityOverrides
severityOverrides
Customizes alert severity per type:Valid severities:
critical, warningincludeContext
includeContext
Specifies what context to include in alerts:
Include pod logs in alerts.
Number of log lines to include (1-10000).
Specific container for logs. Default: first container.
Include Kubernetes events related to the pod.
Include pod status details (phase, container states).
Include AI-generated fix suggestions.
suggestedFixPatterns
suggestedFixPatterns
Custom patterns for suggesting fixes:Available template variables:
{\{.Namespace}\}, {\{.Name}\}, {\{.ExitCode}\}, {\{.Reason}\}, {\{.JobName}\}Match criteria (at least one required):exitCode: Exact exit codeexitCodeRange: Range [min, max] inclusivereason: Container termination reason (case-insensitive)reasonPattern: Regex pattern for reasonlogPattern: Regex pattern for log contenteventPattern: Regex pattern for event messages
dataRetention
Configures data lifecycle management.Days to retain execution history. Overrides global setting.
Behavior when a monitored CronJob is deleted:
retain: Keep execution historypurge: Delete immediatelypurge-after-days: Delete afterpurgeAfterDays
Behavior when a CronJob is recreated (detected via UID change):
retain: Keep old historyreset: Delete history from old UID
Store job logs in database. Overrides global
--storage.log-storage-enabled.Days to retain logs. Defaults to
retentionDays.Max log size per execution in KB. Overrides global
--storage.max-log-size-kb.Status Fields
The controller updates the status with observed state:Monitor operational state:
Initializing, Active, Degraded, ErrorAggregate counts across all monitored CronJobs.
Per-CronJob status including metrics and active alerts.
Full Example
See the full-featured example for all configuration options.Next Steps
Dead-Man's Switch
Learn how missed schedule detection works
SLA Tracking
Understand success rate and regression detection
Alert Channels
Configure Slack, PagerDuty, email, and webhooks
Examples
Browse real-world monitor configurations