Skip to main content

Overview

The CronJobMonitor is a namespaced Custom Resource that defines monitoring configuration for Kubernetes CronJobs. It enables dead-man’s switch monitoring, SLA tracking, alerting, and data retention policies. API Group: guardian.illenium.net/v1alpha1 Kind: CronJobMonitor Scope: Namespaced

Basic Example

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: critical-jobs
  namespace: production
spec:
  selector:
    matchLabels:
      tier: critical
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  sla:
    enabled: true
    minSuccessRate: 95
    windowDays: 7
  alerting:
    enabled: true
    channelRefs:
      - name: slack-ops
        severities: [critical, warning]

Spec Fields

selector

selector
object
Specifies which CronJobs to monitor. An empty selector matches all CronJobs in the monitor’s namespace.

deadManSwitch

deadManSwitch
object
Configures dead-man’s switch behavior to alert if CronJobs don’t execute successfully within expected intervals.

sla

sla
object
Configures SLA tracking for success rates and execution durations.

suspendedHandling

suspendedHandling
object
Configures behavior for suspended CronJobs.

maintenanceWindows

maintenanceWindows
array
Defines scheduled maintenance periods during which alerts can be suppressed.
maintenanceWindows:
  - name: weekly-maintenance
    schedule: "0 2 * * 0"
    duration: 4h
    timezone: America/New_York
    suppressAlerts: true

alerting

alerting
object
Configures alert channels and behavior.

dataRetention

dataRetention
object
Configures data lifecycle management for this monitor’s execution history.
dataRetention:
  retentionDays: 60
  onCronJobDeletion: purge-after-days
  purgeAfterDays: 7
  onRecreation: retain
  storeLogs: true
  logRetentionDays: 30
  maxLogSizeKB: 200
  storeEvents: true

Status Fields

The status subresource provides real-time monitoring state and metrics.
observedGeneration
integer
The generation last processed by the controller.
phase
string
The monitor’s operational state.Values: Initializing, Active, Degraded, Error
lastReconcileTime
timestamp
When the controller last reconciled this monitor.
summary
object
Aggregate counts across all monitored CronJobs.
cronJobs
array
Per-CronJob status information.
conditions
array
Standard Kubernetes condition array. Common condition types:
  • Ready: Monitor is operational and tracking CronJobs
  • Progressing: Monitor is initializing or updating
  • Degraded: Monitor is experiencing issues but operational

Complete Example

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: full-featured
  namespace: production
spec:
  selector:
    matchExpressions:
      - key: tier
        operator: In
        values: [critical, high]

  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h

  sla:
    enabled: true
    minSuccessRate: 95
    windowDays: 7
    maxDuration: 30m
    durationRegressionThreshold: 50
    durationBaselineWindowDays: 14

  suspendedHandling:
    pauseMonitoring: true
    alertIfSuspendedFor: 168h

  maintenanceWindows:
    - name: weekly-maintenance
      schedule: "0 2 * * 0"
      duration: 4h
      timezone: America/New_York
      suppressAlerts: true

  alerting:
    enabled: true
    channelRefs:
      - name: pagerduty-oncall
        severities: [critical]
      - name: slack-ops
        severities: [critical, warning]
    severityOverrides:
      jobFailed: critical
      slaBreached: warning
      missedSchedule: warning
      deadManTriggered: critical
      durationRegression: warning
    suppressDuplicatesFor: 1h
    alertDelay: 5m
    includeContext:
      logs: true
      logLines: 100
      events: true
      podStatus: true
      suggestedFixes: true
    suggestedFixPatterns:
      - name: custom-oom
        match:
          exitCode: 137
        suggestion: "OOM killed. Increase memory for {\{.Namespace}\}/{\{.Name}\}"
        priority: 150

  dataRetention:
    retentionDays: 60
    onCronJobDeletion: purge-after-days
    purgeAfterDays: 7
    onRecreation: retain
    storeLogs: true
    logRetentionDays: 30
    maxLogSizeKB: 200
    storeEvents: true

Build docs developers (and LLMs) love