Skip to main content

What is CronJob Guardian?

CronJob Guardian is a Kubernetes operator that monitors CronJobs with SLA tracking, intelligent alerting, and a built-in dashboard. It ensures your critical scheduled jobs run successfully and alerts you when something goes wrong.

The Problem

CronJobs power critical operations like backups, ETL pipelines, and reports—but Kubernetes provides no built-in monitoring for them. When jobs fail silently or stop running, you only find out when it’s too late. Common issues that go undetected:
  • Silent failures: Jobs fail but no one knows until data is missing
  • Jobs stop running: Schedule issues or resource constraints prevent execution
  • Performance degradation: Jobs slow down gradually over time
  • Resource leaks: Failed jobs consume cluster resources

How Guardian Helps

CronJob Guardian watches your CronJobs and alerts you when something goes wrong, with rich context to help you diagnose and fix issues quickly.

Key Features

Dead-Man's Switch

Alert when CronJobs don’t run within expected windows. Automatically calculates thresholds from cron schedules or set custom intervals.

SLA Tracking

Monitor success rates, duration percentiles (P50/P95/P99), and detect regressions. Set minimum success rates and maximum duration thresholds.

Intelligent Alerts

Get rich context with pod logs, Kubernetes events, and suggested fixes. Alerts include everything you need to diagnose the issue.

Multiple Channels

Send alerts to Slack, PagerDuty, webhooks, or email. Route different severities to different channels.

Built-in Dashboard

Feature-rich web UI with charts, heatmaps, execution history, and CSV exports. No external tools required.

Prometheus Metrics

Export metrics for existing monitoring infrastructure. Integrates with your existing observability stack.

Architecture

CronJob Guardian runs as a single operator pod in your cluster with three main components:
  • Operator: Watches CronJobs and Jobs, tracks execution history, calculates SLA metrics
  • Storage: SQLite (default), PostgreSQL, or MySQL for execution history and metrics
  • Dashboard: Embedded web UI for viewing metrics and execution history
  • Custom Resources: CronJobMonitor and AlertChannel define what to monitor and where to alert

Who Should Use This?

CronJob Guardian is ideal for teams that:
  • Run critical scheduled jobs (backups, ETL, reports)
  • Need to maintain SLA commitments
  • Want to catch failures before customers notice
  • Need visibility into job performance and trends
  • Want centralized monitoring across multiple namespaces or clusters

Example Use Cases

Database Backups

Ensure nightly backups run successfully with 100% success rate monitoring:
spec:
  selector:
    matchLabels:
      type: backup
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h  # Daily + 1h buffer
  sla:
    enabled: true
    minSuccessRate: 100  # Backups must never fail
  alerting:
    channelRefs:
      - name: pagerduty-oncall

ETL Pipelines

Monitor data pipelines with duration regression detection:
spec:
  selector:
    matchLabels:
      type: etl
  sla:
    enabled: true
    maxDuration: 30m
    durationRegression:
      enabled: true
      percentile: 95
      thresholdPercent: 50  # Alert if P95 increases 50%

Financial Reports

Quiet alerts during planned maintenance:
spec:
  selector:
    matchLabels:
      type: report
  maintenanceWindows:
    - name: monthly-maintenance
      schedule: "0 2 1 * *"  # First day of month at 2 AM
      duration: 4h

What’s Next?

Quickstart

Get CronJob Guardian running in 5 minutes

Installation

Detailed installation guide with all configuration options

Core Concepts

Learn about monitors, alert channels, and SLA tracking

Examples

Real-world configuration examples

Build docs developers (and LLMs) love