Advanced Monitor Features

Overview

Beyond basic failure detection, CronJob Guardian provides advanced features for production reliability:

SLA Tracking: Monitor success rates over time
Duration Regression: Detect when jobs start taking longer
Maintenance Windows: Suppress alerts during planned maintenance
Suspended Handling: Manage monitoring of paused CronJobs
Custom Fix Suggestions: Provide automated remediation guidance

Database Backup Monitoring

Critical backup jobs require strict SLA enforcement and fast detection of issues.

monitors/database-backups.yaml

# Database Backup Monitoring
# Monitors critical backup jobs with strict SLA requirements
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: database-backups
  namespace: databases
spec:
  selector:
    matchLabels:
      type: backup
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h  # Daily backups with 1h buffer
  sla:
    enabled: true
    minSuccessRate: 100  # Backups must never fail
    maxDuration: 1h      # Alert if backup takes too long
  alerting:
    channelRefs:
      - name: pagerduty-dba
        severities: [critical]
    severityOverrides:
      jobFailed: critical
      deadManTriggered: critical
    # Custom fix suggestion for backup failures
    suggestedFixPatterns:
      - name: disk-full
        match:
          logPattern: "No space left on device|disk full"
        suggestion: "Backup storage is full. Check PVC usage: kubectl get pvc -n {\{.Namespace}\}"
        priority: 150

What This Does

Enforces 100% success rate (any failure triggers an alert)
Alerts if backups take longer than 1 hour
Pages the DBA team immediately on any issue
Provides custom troubleshooting for disk-full errors

Setup Instructions

Label backup CronJobs

kubectl label cronjob postgres-backup type=backup -n databases
kubectl label cronjob mysql-backup type=backup -n databases

Create PagerDuty alert channel

kubectl create secret generic pagerduty-key \
  --namespace cronjob-guardian \
  --from-literal=routing-key=<dba-team-routing-key>

kubectl apply -f - <<EOF
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
  name: pagerduty-dba
spec:
  type: pagerduty
  pagerduty:
    routingKeySecretRef:
      name: pagerduty-key
      namespace: cronjob-guardian
      key: routing-key
EOF

Apply the monitor

kubectl apply -f database-backups.yaml

Verify SLA tracking

Check the monitor status for SLA metrics:

kubectl get cronjobmonitor database-backups -n databases -o yaml

Look for:

status:
  sla:
    currentSuccessRate: 100.0
    recentExecutions: 24
    failures: 0

A 100% SLA requirement means any failure triggers an alert. This is appropriate for critical backups but may be too strict for other workloads. For most jobs, 95-99% is more realistic.

ETL Pipeline with Duration Regression

Data pipelines should not only succeed but also run in a predictable time window. Detect performance degradation early.

monitors/data-pipeline.yaml

# Data Pipeline Monitoring
# Tracks ETL performance with duration regression detection
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: etl-pipeline
  namespace: data-eng
spec:
  selector:
    matchLabels:
      pipeline: etl
  deadManSwitch:
    enabled: true
    autoFromSchedule:
      enabled: true
      buffer: 30m
  sla:
    enabled: true
    minSuccessRate: 95
    windowDays: 7
    # ETL jobs have duration SLAs
    maxDuration: 2h
    durationRegressionThreshold: 25  # Alert if P95 increases by 25%
    durationBaselineWindowDays: 14
  alerting:
    channelRefs:
      - name: slack-data-eng
    # Include logs to debug ETL failures
    includeContext:
      logs: true
      logLines: 200

What This Does

Auto-detects expected run intervals from CronJob schedules
Alerts if jobs take longer than 2 hours (hard limit)
Detects duration regression if P95 duration increases by 25% compared to the last 14 days
Includes 200 lines of logs in alerts for debugging

Duration Regression Detection

This is powerful for catching performance degradation:

Guardian calculates a baseline P95 duration over the last 14 days
Compares recent runs to the baseline
Alerts if the P95 duration increases by more than 25%

Example:

Baseline P95: 45 minutes
Recent P95: 60 minutes
Increase: 33% (exceeds 25% threshold)
Result: Alert triggered

Tune durationRegressionThreshold based on your workload variability:

Stable workloads: 15-25% (detect small changes)
Variable workloads: 50-75% (avoid false positives)
Data-driven: Start at 50%, reduce as you understand variance

Setup Instructions

Label ETL CronJobs

kubectl label cronjob daily-etl pipeline=etl -n data-eng
kubectl label cronjob hourly-sync pipeline=etl -n data-eng

Apply the monitor

kubectl apply -f data-pipeline.yaml

Wait for baseline

Duration regression requires historical data. Wait at least 14 days (the baseline window) before regression detection activates.Check baseline status:

kubectl get cronjobmonitor etl-pipeline -n data-eng -o jsonpath='{.status.sla.durationBaseline}'

Test regression detection

Simulate a slow job:

# Add a sleep to your job temporarily
command: ["/bin/sh", "-c", "sleep 3600 && /etl.sh"]

After the job runs, check for a regression alert in your Slack channel.

Financial Reports with Maintenance Windows

Suppress alerts during planned downtime like month-end processing or system upgrades.

monitors/financial-reports.yaml

# Financial Report Monitoring
# Monitors business-critical reports with maintenance windows
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: financial-reports
  namespace: finance
spec:
  selector:
    matchNames:
      - daily-revenue-report
      - weekly-summary
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  maintenanceWindows:
    - name: quarter-end
      schedule: "0 0 1 1,4,7,10 *"  # First day of each quarter
      duration: 24h
      suppressAlerts: true
  alerting:
    channelRefs:
      - name: slack-finance

What This Does

Monitors two specific reports by name
Defines a quarterly maintenance window on Jan 1, Apr 1, Jul 1, Oct 1
Suppresses alerts for 24 hours during quarter-end processing
Automatically resumes monitoring after the window

Maintenance Window Examples

maintenanceWindows:
  - name: weekly-maintenance
    schedule: "0 2 * * 0"  # Every Sunday at 2 AM
    duration: 4h
    timezone: America/New_York
    suppressAlerts: true

Maintenance window schedules use the same cron syntax as CronJobs. Use crontab.guru to verify your expressions.

Suspended CronJob Handling

Control how monitoring behaves when CronJobs are manually suspended.

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: with-suspend-handling
  namespace: production
spec:
  selector:
    matchLabels:
      tier: critical
  suspendedHandling:
    pauseMonitoring: true           # Pause monitoring when CronJob is suspended
    alertIfSuspendedFor: 168h       # Alert if suspended for more than 7 days
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  alerting:
    channelRefs:
      - name: slack-ops

What This Does

When a CronJob is suspended (.spec.suspend: true), monitoring pauses
No dead-man’s switch alerts while suspended
If the CronJob remains suspended for 7 days, send a reminder alert
Monitoring automatically resumes when unsuspended

Use Cases

Short-term suspension: Pause a job for debugging without triggering alerts
Long-term reminder: Detect forgotten suspended jobs
Planned downtime: Suspend jobs during migrations without alert noise

Full-Featured Configuration

A comprehensive example demonstrating all available options:

monitors/full-featured.yaml (excerpt)

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: full-featured
  namespace: production
spec:
  selector:
    matchExpressions:
      - key: tier
        operator: In
        values: [critical, high]

  deadManSwitch:
    enabled: true
    autoFromSchedule:
      enabled: true
      buffer: 1h
      missedScheduleThreshold: 2  # Alert after 2 missed schedules

  sla:
    enabled: true
    minSuccessRate: 95
    windowDays: 7
    maxDuration: 30m
    durationRegressionThreshold: 50
    durationBaselineWindowDays: 14

  suspendedHandling:
    pauseMonitoring: true
    alertIfSuspendedFor: 168h

  maintenanceWindows:
    - name: weekly-maintenance
      schedule: "0 2 * * 0"
      duration: 4h
      timezone: America/New_York
      suppressAlerts: true

  alerting:
    enabled: true
    channelRefs:
      - name: pagerduty-oncall
        severities: [critical]
      - name: slack-ops
        severities: [critical, warning]
    
    severityOverrides:
      jobFailed: critical
      slaBreached: warning
      missedSchedule: warning
      deadManTriggered: critical
      durationRegression: warning
    
    suppressDuplicatesFor: 1h
    alertDelay: 5m
    
    includeContext:
      logs: true
      logLines: 100
      events: true
      podStatus: true
      suggestedFixes: true
    
    suggestedFixPatterns:
      - name: custom-oom
        match:
          exitCode: 137
        suggestion: "Container was OOM killed. Consider increasing memory limits for {\{.Namespace}\}/{\{.Name}\}"
        priority: 150

  dataRetention:
    retentionDays: 60
    onCronJobDeletion: purge-after-days
    purgeAfterDays: 7
    storeLogs: true
    logRetentionDays: 30

Key Features Explained

Auto Dead-Man's Switch

deadManSwitch:
  autoFromSchedule:
    enabled: true
    buffer: 1h
    missedScheduleThreshold: 2

Instead of hardcoding maxTimeSinceLastSuccess, Guardian calculates it from the CronJob’s schedule:

Schedule: 0 */6 * * * (every 6 hours)
Expected interval: 6 hours
With buffer: 7 hours
Alert after: 2 missed schedules = 14 hours

Severity Overrides

severityOverrides:
  jobFailed: critical      # Job failures are critical
  slaBreached: warning     # SLA breach is a warning
  durationRegression: warning

Customize alert severity per alert type. Default severities:

jobFailed: critical
deadManTriggered: critical
slaBreached: warning
missedSchedule: warning
durationRegression: warning

Alert Delays and Deduplication

alerting:
  alertDelay: 5m             # Wait 5 minutes before sending
  suppressDuplicatesFor: 1h  # Don't resend same alert for 1 hour

Alert Delay: Waits 5 minutes before sending. If the issue resolves (e.g., job retries and succeeds), the alert is cancelled.
Suppress Duplicates: Prevents alert fatigue by not resending the same alert multiple times.

Context Inclusion

includeContext:
  logs: true
  logLines: 100
  events: true
  podStatus: true
  suggestedFixes: true

Alerts include:

Last 100 lines of pod logs
Kubernetes events related to the job
Pod status (exit codes, reasons)
AI-generated fix suggestions based on error patterns

Custom Fix Patterns

suggestedFixPatterns:
  - name: custom-oom
    match:
      exitCode: 137
    suggestion: "Container was OOM killed. Increase memory limits."
    priority: 150

Define custom troubleshooting advice based on:

Exit codes
Log patterns (regex)
Error messages

Higher priority patterns (>100) override built-in suggestions.

Data Retention

dataRetention:
  retentionDays: 60
  onCronJobDeletion: purge-after-days
  purgeAfterDays: 7

Keep execution history for 60 days
When a CronJob is deleted, retain data for 7 more days before purging
Useful for post-mortem analysis of deleted jobs

Common Patterns

High-Availability Jobs

sla:
  enabled: true
  minSuccessRate: 99.9
  windowDays: 30  # Longer window for accurate percentage
  maxDuration: 15m
alertring:
  channelRefs:
    - name: pagerduty-critical
      severities: [critical]
  alertDelay: 0  # No delay, page immediately

Variable-Duration Jobs

sla:
  enabled: true
  # Don't set maxDuration for jobs with variable runtime
  durationRegressionThreshold: 100  # Alert if duration doubles
  durationBaselineWindowDays: 30    # Longer baseline for stability

Development Environment

sla:
  enabled: true
  minSuccessRate: 80  # More lenient
  windowDays: 7
alertring:
  channelRefs:
    - name: slack-dev
      severities: [critical]  # Only critical alerts
  suppressDuplicatesFor: 6h  # Reduce noise

Next Steps

Slack Alerts

Configure Slack alert channels

PagerDuty Alerts

Set up on-call escalation

Webhook Alerts

Integrate with custom systems

Monitor Reference

Complete API documentation

Monitors

Alert Channels

Overview

Database Backup Monitoring

What This Does

Setup Instructions

ETL Pipeline with Duration Regression

What This Does

Duration Regression Detection

Setup Instructions

Financial Reports with Maintenance Windows

What This Does

Maintenance Window Examples

Suspended CronJob Handling

What This Does

Use Cases

Full-Featured Configuration

Key Features Explained

Common Patterns

High-Availability Jobs

Variable-Duration Jobs

Development Environment

Next Steps

Slack Alerts

PagerDuty Alerts

Webhook Alerts

Monitor Reference

Build docs developers (and LLMs) love

Monitors

Alert Channels

​Overview

​Database Backup Monitoring

​What This Does

​Setup Instructions

​ETL Pipeline with Duration Regression

​What This Does

​Duration Regression Detection

​Setup Instructions

​Financial Reports with Maintenance Windows

​What This Does

​Maintenance Window Examples

​Suspended CronJob Handling

​What This Does

​Use Cases

​Full-Featured Configuration

​Key Features Explained

​Common Patterns

​High-Availability Jobs

​Variable-Duration Jobs

​Development Environment

​Next Steps

Slack Alerts

PagerDuty Alerts

Webhook Alerts

Monitor Reference

Build docs developers (and LLMs) love

Overview

Database Backup Monitoring

What This Does

Setup Instructions

ETL Pipeline with Duration Regression

What This Does

Duration Regression Detection

Setup Instructions

Financial Reports with Maintenance Windows

What This Does

Maintenance Window Examples

Suspended CronJob Handling

What This Does

Use Cases

Full-Featured Configuration

Key Features Explained

Common Patterns

High-Availability Jobs

Variable-Duration Jobs

Development Environment

Next Steps