CronJobMonitor API Reference

Overview

The CronJobMonitor is a namespaced Custom Resource that defines monitoring configuration for Kubernetes CronJobs. It enables dead-man’s switch monitoring, SLA tracking, alerting, and data retention policies. API Group: guardian.illenium.net/v1alpha1 Kind: CronJobMonitor Scope: Namespaced

Basic Example

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: critical-jobs
  namespace: production
spec:
  selector:
    matchLabels:
      tier: critical
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  sla:
    enabled: true
    minSuccessRate: 95
    windowDays: 7
  alerting:
    enabled: true
    channelRefs:
      - name: slack-ops
        severities: [critical, warning]

Spec Fields

selector

object

Specifies which CronJobs to monitor. An empty selector matches all CronJobs in the monitor’s namespace.

Show properties

matchLabels

map[string]string

Selects CronJobs by exact label matches.

matchLabels:
  tier: critical
  app: backup

matchExpressions

array

Selects CronJobs by label expressions. Each expression is a LabelSelectorRequirement.

matchExpressions:
  - key: tier
    operator: In
    values: [critical, high]
  - key: environment
    operator: NotIn
    values: [dev, test]

Supported operators: In, NotIn, Exists, DoesNotExist

matchNames

array

Explicitly lists CronJob names to monitor. Only valid when watching a single namespace.

matchNames:
  - daily-backup
  - weekly-report

namespaces

array

Explicitly lists namespaces to watch for CronJobs. If empty and namespaceSelector is not set, watches only the monitor’s namespace.

namespaces:
  - production
  - staging

namespaceSelector

object

Selects namespaces by labels. CronJobs in matching namespaces will be monitored. Uses standard LabelSelector format.

namespaceSelector:
  matchLabels:
    env: production

allNamespaces

boolean

default:"false"

Watches CronJobs in all namespaces (except globally ignored ones). Takes precedence over namespaces and namespaceSelector.

deadManSwitch

object

Configures dead-man’s switch behavior to alert if CronJobs don’t execute successfully within expected intervals.

Show properties

enabled

boolean

default:"true"

Turns on dead-man’s switch monitoring.

maxTimeSinceLastSuccess

duration

Alerts if no successful execution within this duration. Use for fixed-interval monitoring.Example: "25h" for daily jobs with 1-hour bufferMutually exclusive with autoFromSchedule.

autoFromSchedule

object

Auto-calculates expected interval from the CronJob’s schedule. Alternative to maxTimeSinceLastSuccess.

Show properties

enabled

boolean

required

Turns on auto-detection from cron schedule.

buffer

duration

default:"1h"

Extra time added to the expected interval to account for delays.

missedScheduleThreshold

integer

default:"1"

Number of missed schedules before alerting.Validation: Minimum value is 1

sla

object

Configures SLA tracking for success rates and execution durations.

Show properties

enabled

boolean

default:"true"

Turns on SLA tracking and alerting.

minSuccessRate

number

default:"95"

Minimum acceptable success rate percentage. Alerts when success rate falls below this threshold.Validation: Range 0-100

windowDays

integer

default:"7"

Rolling window in days for success rate calculation.Validation: Minimum value is 1

maxDuration

duration

Maximum acceptable execution duration. Alerts if any job execution exceeds this duration.Example: "30m" for 30 minutes

durationRegressionThreshold

integer

default:"50"

Alerts if P95 duration increases by this percentage compared to the baseline.Validation: Range 1-1000Example: 50 means alert if P95 increases by 50% or more

durationBaselineWindowDays

integer

default:"14"

Number of days for calculating the baseline duration.Validation: Minimum value is 1

suspendedHandling

object

Configures behavior for suspended CronJobs.

Show properties

pauseMonitoring

boolean

default:"true"

Pauses monitoring and suppresses alerts when the CronJob is suspended.

alertIfSuspendedFor

duration

Alerts if a CronJob remains suspended longer than this duration. Useful for detecting CronJobs that were suspended and forgotten.Example: "168h" for 7 days

maintenanceWindows

array

Defines scheduled maintenance periods during which alerts can be suppressed.

Show array item properties

name

string

required

Identifies this maintenance window.

schedule

string

required

Cron expression defining when the window starts.Example: "0 2 * * 0" for every Sunday at 2 AM

duration

required

How long the maintenance window lasts.Example: "4h" for 4 hours

timezone

string

default:"UTC"

Timezone for the schedule.Example: "America/New_York"

suppressAlerts

boolean

default:"true"

Whether to suppress alerts during this window.

maintenanceWindows:
  - name: weekly-maintenance
    schedule: "0 2 * * 0"
    duration: 4h
    timezone: America/New_York
    suppressAlerts: true

alerting

object

Configures alert channels and behavior.

Show properties

enabled

boolean

default:"true"

Turns on alerting for this monitor.

channelRefs

array

References to cluster-scoped AlertChannel custom resources.

Show array item properties

name

string

required

Name of the AlertChannel CR.

severities

array

Alert severities to send to this channel. Valid values: critical, warning. If empty, sends all severities.

channelRefs:
  - name: pagerduty-oncall
    severities: [critical]
  - name: slack-ops
    severities: [critical, warning]

includeContext

object

Specifies what context to include in alert notifications.

Show properties

logs

boolean

default:"true"

Include pod logs in alerts.

logLines

integer

default:"50"

Number of log lines to include.Validation: Range 1-10000

logContainerName

string

Specific container name for logs. Defaults to the first container.

includeInitContainerLogs

boolean

default:"false"

Include init container logs.

events

boolean

default:"true"

Include Kubernetes events.

podStatus

boolean

default:"true"

Include pod status details.

suggestedFixes

boolean

default:"true"

Include suggested fixes based on failure patterns.

suppressDuplicatesFor

duration

default:"1h"

Prevents re-sending the same alert within this time window.

alertDelay

duration

Delays alert dispatch to allow transient issues to resolve. If the issue resolves (e.g., next job succeeds) before the delay expires, the alert is cancelled and never sent. Useful for flaky jobs.Example: "5m" waits 5 minutes before sending failure alerts

severityOverrides

object

Customizes severity for specific alert types. Only critical and warning are valid.

Show properties

missedSchedule

string

Severity for missed schedule alerts.Validation: Enum critical, warning

jobFailed

string

Severity for job failure alerts.Validation: Enum critical, warning

slaBreached

string

Severity for SLA breach alerts.Validation: Enum critical, warning

deadManTriggered

string

Severity for dead-man’s switch alerts.Validation: Enum critical, warning

durationRegression

string

Severity for duration regression alerts.Validation: Enum critical, warning

severityOverrides:
  jobFailed: critical
  slaBreached: warning
  deadManTriggered: critical

suggestedFixPatterns

array

Defines custom patterns for suggesting fixes based on failure context. These are merged with built-in patterns, with custom patterns taking priority.

Show array item properties

name

string

required

Identifies this pattern. Use built-in names like "oom-killed" to override built-in patterns.

match

object

required

Match criteria - at least one field must be specified.

Show properties

exitCode

integer

Matches specific exit codes. Example: 137 for OOM killed.

exitCodeRange

object

Matches a range of exit codes [min, max] inclusive.

Show properties

min

integer

required

Minimum exit code.

max

integer

required

Maximum exit code.

reason

string

Matches container termination reason (exact match, case-insensitive). Example: "OOMKilled"

reasonPattern

string

Matches reason using regex. Example: ".*Killed.*"

logPattern

string

Matches log content using regex. Example: "connection timed out|ETIMEDOUT"

eventPattern

string

Matches event messages using regex.

suggestion

string

required

Fix text to display. Supports Go template variables:

{\{.Namespace}\} - CronJob namespace
{\{.Name}\} - CronJob name
{\{.ExitCode}\} - Container exit code
{\{.Reason}\} - Termination reason
{\{.JobName}\} - Job name

priority

integer

default:"0"

Determines evaluation order (higher = checked first). Built-in patterns use priorities 1-100. Use values >100 to override built-ins.

suggestedFixPatterns:
  - name: custom-oom
    match:
      exitCode: 137
    suggestion: "Container OOM killed. Increase memory in {\{.Namespace}\}/{\{.Name}\}"
    priority: 150
  - name: connection-timeout
    match:
      logPattern: "connection timed out|ETIMEDOUT"
    suggestion: "Network timeout. Check connectivity to external services."
    priority: 50

dataRetention

object

Configures data lifecycle management for this monitor’s execution history.

Show properties

retentionDays

integer

Overrides global retention for this monitor’s execution history. If not set, uses the global history-retention.default-days setting.Validation: Minimum value is 1

onCronJobDeletion

string

Defines behavior when a monitored CronJob is deleted.Validation: Enum retain, purge, purge-after-days

retain: Keep all historical data
purge: Immediately delete all data
purge-after-days: Wait before purging (requires purgeAfterDays)

purgeAfterDays

integer

Days to wait before purging data. Required when onCronJobDeletion is purge-after-days.Validation: Minimum value is 0

onRecreation

string

Defines behavior when a CronJob is recreated (detected via UID change).Validation: Enum retain, reset

retain: Keep old history
reset: Delete history from the old UID

storeLogs

boolean

Enables storing job logs in the database. If not set, uses global --storage.log-storage-enabled setting.

logRetentionDays

integer

How long to keep stored logs. If not set, uses the same value as retentionDays.Validation: Minimum value is 1

maxLogSizeKB

integer

Maximum log size to store per execution in KB. If not set, uses global --storage.max-log-size-kb setting.Validation: Minimum value is 1

storeEvents

boolean

Enables storing Kubernetes events in the database. If not set, uses global --storage.event-storage-enabled setting.

dataRetention:
  retentionDays: 60
  onCronJobDeletion: purge-after-days
  purgeAfterDays: 7
  onRecreation: retain
  storeLogs: true
  logRetentionDays: 30
  maxLogSizeKB: 200
  storeEvents: true

Status Fields

The status subresource provides real-time monitoring state and metrics.

observedGeneration

integer

The generation last processed by the controller.

phase

string

The monitor’s operational state.Values: Initializing, Active, Degraded, Error

lastReconcileTime

timestamp

When the controller last reconciled this monitor.

summary

object

Aggregate counts across all monitored CronJobs.

Show properties

totalCronJobs

integer

Total number of CronJobs being monitored.

healthy

integer

Number of healthy CronJobs.

warning

integer

Number of CronJobs with warnings.

critical

integer

Number of CronJobs in critical state.

suspended

integer

Number of suspended CronJobs.

running

integer

Number of CronJobs with currently running jobs.

activeAlerts

integer

Total number of active alerts across all CronJobs.

cronJobs

array

Per-CronJob status information.

Show array item properties

name

string

CronJob name.

namespace

string

CronJob namespace.

status

string

Health status.Values: healthy, warning, critical, suspended, unknown

suspended

boolean

Whether the CronJob is suspended.

lastSuccessfulTime

timestamp

When the last Job succeeded.

lastFailedTime

timestamp

When the last Job failed.

lastRunDuration

duration

Duration of the last completed Job.

nextScheduledTime

timestamp

When the next Job will be created.

metrics

object

SLA metrics for this CronJob.

Show properties

successRate

number

Success rate percentage.

totalRuns

integer

Total number of job runs.

successfulRuns

integer

Number of successful runs.

failedRuns

integer

Number of failed runs.

avgDurationSeconds

number

Average execution duration in seconds.

p50DurationSeconds

number

50th percentile (median) duration in seconds.

p95DurationSeconds

number

95th percentile duration in seconds.

p99DurationSeconds

number

99th percentile duration in seconds.

activeJobs

array

Currently running jobs for this CronJob.

Show array item properties

name

string

Job name.

startTime

timestamp

When the job started.

runningDuration

duration

How long the job has been running.

podPhase

string

Current pod phase (Pending, Running, etc.).

podName

string

Name of the pod running the job.

ready

string

Pod readiness (e.g., “1/1”).

activeAlerts

array

Current alerts for this CronJob.

Show array item properties

type

string

Alert type (e.g., “JobFailed”, “MissedSchedule”, “SLABreached”).

severity

string

Alert severity (“critical” or “warning”).

message

string

Human-readable alert description.

since

timestamp

When the alert became active.

lastNotified

timestamp

When the alert was last sent to channels.

exitCode

integer

Container exit code (for JobFailed alerts).

reason

string

Failure reason (e.g., “OOMKilled”, “Error”).

suggestedFix

string

Actionable guidance for resolving the alert.

conditions

array

Standard Kubernetes condition array. Common condition types:

Ready: Monitor is operational and tracking CronJobs
Progressing: Monitor is initializing or updating
Degraded: Monitor is experiencing issues but operational

Complete Example

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: full-featured
  namespace: production
spec:
  selector:
    matchExpressions:
      - key: tier
        operator: In
        values: [critical, high]

  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h

  sla:
    enabled: true
    minSuccessRate: 95
    windowDays: 7
    maxDuration: 30m
    durationRegressionThreshold: 50
    durationBaselineWindowDays: 14

  suspendedHandling:
    pauseMonitoring: true
    alertIfSuspendedFor: 168h

  maintenanceWindows:
    - name: weekly-maintenance
      schedule: "0 2 * * 0"
      duration: 4h
      timezone: America/New_York
      suppressAlerts: true

  alerting:
    enabled: true
    channelRefs:
      - name: pagerduty-oncall
        severities: [critical]
      - name: slack-ops
        severities: [critical, warning]
    severityOverrides:
      jobFailed: critical
      slaBreached: warning
      missedSchedule: warning
      deadManTriggered: critical
      durationRegression: warning
    suppressDuplicatesFor: 1h
    alertDelay: 5m
    includeContext:
      logs: true
      logLines: 100
      events: true
      podStatus: true
      suggestedFixes: true
    suggestedFixPatterns:
      - name: custom-oom
        match:
          exitCode: 137
        suggestion: "OOM killed. Increase memory for {\{.Namespace}\}/{\{.Name}\}"
        priority: 150

  dataRetention:
    retentionDays: 60
    onCronJobDeletion: purge-after-days
    purgeAfterDays: 7
    onRecreation: retain
    storeLogs: true
    logRetentionDays: 30
    maxLogSizeKB: 200
    storeEvents: true

Custom Resources

Configuration

Overview

Basic Example

Spec Fields

selector

deadManSwitch

sla

suspendedHandling

maintenanceWindows

alerting

dataRetention

Status Fields

Complete Example

Build docs developers (and LLMs) love

Custom Resources

Configuration

​Overview

​Basic Example

​Spec Fields

​selector

​deadManSwitch

​sla

​suspendedHandling

​maintenanceWindows

​alerting

​dataRetention

​Status Fields

​Complete Example

Build docs developers (and LLMs) love

Overview

Basic Example

Spec Fields

selector

deadManSwitch

sla

suspendedHandling

maintenanceWindows

alerting

dataRetention

Status Fields

Complete Example