CronJobMonitor CRD

Overview

The CronJobMonitor is the primary custom resource for defining monitoring policies. It specifies:

Which CronJobs to monitor (via selectors)
What conditions trigger alerts (dead-man’s switch, SLA thresholds)
Where to send alerts (via channel references)
How to handle data retention and lifecycle

Basic Example

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: critical-jobs
  namespace: production
spec:
  selector:
    matchLabels:
      tier: critical
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  sla:
    enabled: true
    minSuccessRate: 99
    windowDays: 7
  alerting:
    channelRefs:
      - name: slack-ops

Spec Fields

selector

Defines which CronJobs to monitor. An empty selector matches all CronJobs in the monitor’s namespace.

// From api/v1alpha1/cronjobmonitor_types.go:38-67
type CronJobSelector struct {
    MatchLabels          map[string]string
    MatchExpressions     []metav1.LabelSelectorRequirement
    MatchNames           []string
    Namespaces           []string
    NamespaceSelector    *metav1.LabelSelector
    AllNamespaces        bool
}

matchLabels

Selects CronJobs with exact label matches:

selector:
  matchLabels:
    tier: critical
    environment: production

All specified labels must match (logical AND).

matchExpressions

Selects CronJobs using label selector expressions:

selector:
  matchExpressions:
    - key: tier
      operator: In
      values: [critical, high]
    - key: team
      operator: Exists

Supported operators: In, NotIn, Exists, DoesNotExist

matchNames

Explicitly lists CronJob names to monitor:

selector:
  matchNames:
    - daily-backup
    - weekly-report

Only valid when watching a single namespace. Combines with label selectors using AND logic.

namespaces

Specifies namespaces to watch:

selector:
  namespaces:
    - production
    - staging

The monitor will watch CronJobs matching label selectors in any of these namespaces.

namespaceSelector

Selects namespaces dynamically by labels:

selector:
  namespaceSelector:
    matchLabels:
      monitoring: enabled

The monitor will watch all namespaces with monitoring=enabled label.

allNamespaces

Enables cluster-wide monitoring:

selector:
  allNamespaces: true
  matchLabels:
    critical: "true"

Watches all CronJobs matching label selectors across all namespaces. Takes precedence over namespaces and namespaceSelector.

deadManSwitch

Configures dead-man’s switch alerting when jobs don’t run on time.

// From api/v1alpha1/cronjobmonitor_types.go:69-98
type DeadManSwitchConfig struct {
    Enabled                 *bool
    MaxTimeSinceLastSuccess *metav1.Duration
    AutoFromSchedule        *AutoScheduleConfig
}

type AutoScheduleConfig struct {
    Enabled                 bool
    Buffer                  *metav1.Duration
    MissedScheduleThreshold *int32
}

enabled

Enables/disables dead-man’s switch monitoring. Default: true

deadManSwitch:
  enabled: false  # Disable for this monitor

maxTimeSinceLastSuccess

Fixed time window for detecting missed runs:

deadManSwitch:
  maxTimeSinceLastSuccess: 25h  # Alert if no success in 25 hours

Use this for jobs with predictable schedules. For a daily job at midnight, set to 25h (24h + 1h buffer).

autoFromSchedule

Auto-calculates expected interval from the CronJob’s schedule:

deadManSwitch:
  autoFromSchedule:
    enabled: true
    buffer: 1h                    # Extra time beyond schedule interval
    missedScheduleThreshold: 2    # Alert after missing 2 runs

The controller parses the cron expression (e.g., 0 0 * * *) and calculates the interval between runs. The buffer accounts for execution time and scheduling delays.

Auto-detection requires a parseable cron schedule. Complex expressions or timezone-specific schedules may produce incorrect intervals.

sla

Configures SLA tracking for success rates and execution durations.

// From api/v1alpha1/cronjobmonitor_types.go:100-131
type SLAConfig struct {
    Enabled                      *bool
    MinSuccessRate               *float64
    WindowDays                   *int32
    MaxDuration                  *metav1.Duration
    DurationRegressionThreshold  *int32
    DurationBaselineWindowDays   *int32
}

enabled

bool

default:"true"

Enables SLA tracking and alerting.

minSuccessRate

float64

default:"95.0"

Minimum acceptable success rate percentage (0-100).

sla:
  minSuccessRate: 99  # 99% uptime requirement

windowDays

int32

default:"7"

Rolling window size in days for calculating success rate.

sla:
  windowDays: 30  # 30-day rolling window

maxDuration

duration

Alert if any job execution exceeds this duration.

sla:
  maxDuration: 30m  # Alert if job runs longer than 30 minutes

durationRegressionThreshold

int32

default:"50"

Alert if P95 duration increases by this percentage.

sla:
  durationRegressionThreshold: 50  # Alert on 50% slowdown

See SLA Tracking for details.

durationBaselineWindowDays

int32

default:"14"

Window size for calculating baseline duration for regression detection.

sla:
  durationBaselineWindowDays: 14  # 2-week baseline

suspendedHandling

Configures behavior when CronJobs are suspended.

suspendedHandling:
  pauseMonitoring: true           # Stop monitoring when suspended
  alertIfSuspendedFor: 168h       # Alert if suspended > 7 days

Suspended CronJobs don’t create new Jobs. Setting pauseMonitoring: true prevents false alarms from dead-man’s switch during planned suspensions.

maintenanceWindows

Defines scheduled maintenance periods to suppress alerts.

maintenanceWindows:
  - name: weekly-maintenance
    schedule: "0 2 * * 0"         # Every Sunday at 2 AM
    duration: 4h
    timezone: America/New_York
    suppressAlerts: true

schedule

string

required

Cron expression for when the window starts.

duration

required

How long the maintenance window lasts.

timezone

string

default:"UTC"

Timezone for the schedule (e.g., America/New_York, Europe/London).

suppressAlerts

bool

default:"true"

Whether to suppress alert dispatch during this window.

alerting

Configures alert channels and behavior.

// From api/v1alpha1/cronjobmonitor_types.go:164-197
type AlertingConfig struct {
    Enabled                 *bool
    ChannelRefs             []ChannelRef
    IncludeContext          *AlertContext
    SuppressDuplicatesFor   *metav1.Duration
    AlertDelay              *metav1.Duration
    SeverityOverrides       *SeverityOverrides
    SuggestedFixPatterns    []SuggestedFixPattern
}

channelRefs

References to AlertChannel CRs (cluster-scoped):

alerting:
  channelRefs:
    - name: pagerduty-oncall
      severities: [critical]       # Only critical alerts
    - name: slack-ops
      severities: [critical, warning]  # All actionable alerts

Only critical and warning severities are supported. Info-level notifications are not part of the alerting model.

suppressDuplicatesFor

Time window to suppress duplicate alerts:

alerting:
  suppressDuplicatesFor: 1h  # Don't re-alert within 1 hour

Duplicate suppression is bypassed if the error signature changes (e.g., OOM → connection timeout).

alertDelay

Delays alert dispatch to allow transient issues to resolve:

alerting:
  alertDelay: 5m  # Wait 5 minutes before sending

If the job succeeds before the delay expires, the pending alert is cancelled. Useful for flaky jobs.

severityOverrides

Customizes alert severity per type:

alerting:
  severityOverrides:
    jobFailed: critical
    slaBreached: warning
    missedSchedule: warning
    deadManTriggered: critical
    durationRegression: warning

Valid severities: critical, warning

includeContext

Specifies what context to include in alerts:

alerting:
  includeContext:
    logs: true
    logLines: 100
    logContainerName: main
    includeInitContainerLogs: false
    events: true
    podStatus: true
    suggestedFixes: true

logs

bool

default:"true"

Include pod logs in alerts.

logLines

int32

default:"50"

Number of log lines to include (1-10000).

logContainerName

string

Specific container for logs. Default: first container.

events

bool

default:"true"

Include Kubernetes events related to the pod.

podStatus

bool

default:"true"

Include pod status details (phase, container states).

suggestedFixes

bool

default:"true"

Include AI-generated fix suggestions.

suggestedFixPatterns

Custom patterns for suggesting fixes:

alerting:
  suggestedFixPatterns:
    - name: custom-oom
      match:
        exitCode: 137
      suggestion: "Increase memory limits for {\{.Namespace}\}/{\{.Name}\}"
      priority: 150  # Higher than built-ins (1-100)
    - name: connection-timeout
      match:
        logPattern: "connection timed out|ETIMEDOUT"
      suggestion: "Check network connectivity to external services"
      priority: 50

Available template variables: {\{.Namespace}\}, {\{.Name}\}, {\{.ExitCode}\}, {\{.Reason}\}, {\{.JobName}\}Match criteria (at least one required):

exitCode: Exact exit code
exitCodeRange: Range [min, max] inclusive
reason: Container termination reason (case-insensitive)
reasonPattern: Regex pattern for reason
logPattern: Regex pattern for log content
eventPattern: Regex pattern for event messages

dataRetention

Configures data lifecycle management.

dataRetention:
  retentionDays: 60
  onCronJobDeletion: purge-after-days
  purgeAfterDays: 7
  onRecreation: retain
  storeLogs: true
  logRetentionDays: 30
  maxLogSizeKB: 200
  storeEvents: true

retentionDays

int32

Days to retain execution history. Overrides global setting.

onCronJobDeletion

string

default:"retain"

Behavior when a monitored CronJob is deleted:

retain: Keep execution history
purge: Delete immediately
purge-after-days: Delete after purgeAfterDays

onRecreation

string

default:"retain"

Behavior when a CronJob is recreated (detected via UID change):

retain: Keep old history
reset: Delete history from old UID

storeLogs

bool

Store job logs in database. Overrides global --storage.log-storage-enabled.

logRetentionDays

int32

Days to retain logs. Defaults to retentionDays.

maxLogSizeKB

int32

Max log size per execution in KB. Overrides global --storage.max-log-size-kb.

Status Fields

The controller updates the status with observed state:

status:
  observedGeneration: 1
  phase: Active
  lastReconcileTime: "2026-03-04T10:30:00Z"
  summary:
    totalCronJobs: 5
    healthy: 3
    warning: 1
    critical: 1
    suspended: 0
    running: 2
    activeAlerts: 2
  cronJobs:
    - name: daily-backup
      namespace: production
      status: healthy
      suspended: false
      lastSuccessfulTime: "2026-03-04T00:00:00Z"
      lastRunDuration: 15m30s
      nextScheduledTime: "2026-03-05T00:00:00Z"
      metrics:
        successRate: 98.5
        totalRuns: 30
        successfulRuns: 29
        failedRuns: 1
        avgDurationSeconds: 920
        p50DurationSeconds: 900
        p95DurationSeconds: 950
        p99DurationSeconds: 980
      activeAlerts: []

phase

string

Monitor operational state: Initializing, Active, Degraded, Error

summary

object

Aggregate counts across all monitored CronJobs.

cronJobs[]

array

Per-CronJob status including metrics and active alerts.

Full Example

See the full-featured example for all configuration options.

Next Steps

Dead-Man's Switch

Learn how missed schedule detection works

SLA Tracking

Understand success rate and regression detection

Alert Channels

Configure Slack, PagerDuty, email, and webhooks

Examples

Browse real-world monitor configurations

Get Started

Core Concepts

Guides

Operations

Overview

Basic Example

Spec Fields

selector

deadManSwitch

sla

suspendedHandling

maintenanceWindows

alerting

dataRetention

Status Fields

Full Example

Next Steps

Dead-Man's Switch

SLA Tracking

Alert Channels

Examples

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Operations

​Overview

​Basic Example

​Spec Fields

​selector

​deadManSwitch

​sla

​suspendedHandling

​maintenanceWindows

​alerting

​dataRetention

​Status Fields

​Full Example

​Next Steps

Dead-Man's Switch

SLA Tracking

Alert Channels

Examples

Build docs developers (and LLMs) love

Overview

Basic Example

Spec Fields

selector

deadManSwitch

sla

suspendedHandling

maintenanceWindows

alerting

dataRetention

Status Fields

Full Example

Next Steps