Cluster-Wide Monitoring - CronJob Guardian

Overview

Cluster-wide monitoring allows platform teams to maintain visibility across all CronJobs in the cluster from a single CronJobMonitor resource. This approach is ideal for:

SRE/Platform teams responsible for overall cluster health
Monitoring critical infrastructure jobs across all teams
Enforcing SLA standards organization-wide

Basic Cluster-Wide Monitor

The simplest cluster-wide configuration monitors all CronJobs with a specific label.

monitors/cluster-wide.yaml

# Monitor all CronJobs cluster-wide
# Watches all namespaces with optional label filtering
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: cluster-wide-monitor
  namespace: cronjob-guardian
spec:
  selector:
    # Watch all namespaces
    allNamespaces: true
    # Optionally filter by labels
    matchLabels:
      tier: critical
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  alerting:
    channelRefs:
      - name: pagerduty-critical
        severities: [critical]
      - name: slack-ops
        severities: [critical, warning]

What This Does

Watches every namespace in the cluster
Monitors CronJobs with the tier: critical label
Routes critical alerts to PagerDuty for on-call escalation
Sends all alerts to Slack for team visibility

Setup Instructions

Verify RBAC permissions

Ensure the CronJob Guardian controller has cluster-wide permissions:

kubectl auth can-i list cronjobs --all-namespaces --as system:serviceaccount:cronjob-guardian:cronjob-guardian-controller-manager

Should return yes. The default Helm installation configures this automatically.

Create alert channels

Set up both PagerDuty and Slack alert channels:

# Create PagerDuty routing key secret
kubectl create secret generic pagerduty-key \
  --namespace cronjob-guardian \
  --from-literal=routing-key=<your-routing-key>

# Create Slack webhook secret
kubectl create secret generic slack-webhook \
  --namespace cronjob-guardian \
  --from-literal=url=https://hooks.slack.com/services/...

# Apply alert channels
kubectl apply -f alertchannels/pagerduty.yaml
kubectl apply -f alertchannels/slack.yaml

Apply the cluster-wide monitor

kubectl apply -f cluster-wide.yaml

Verify discovery

Check which jobs are being monitored:

kubectl describe cronjobmonitor cluster-wide-monitor -n cronjob-guardian

The status shows all discovered jobs:

Status:
  Monitored Jobs:
    - Name: database-backup
      Namespace: production
      Tier: critical
    - Name: etl-pipeline
      Namespace: data-eng
      Tier: critical
  Total: 15

Multi-Tier Cluster Monitoring

Create separate monitors for different criticality tiers with appropriate routing.

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: cluster-critical
  namespace: cronjob-guardian
spec:
  selector:
    allNamespaces: true
    matchLabels:
      tier: critical
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  sla:
    enabled: true
    minSuccessRate: 99.9  # Strict SLA
    windowDays: 7
  alerting:
    channelRefs:
      - name: pagerduty-oncall  # Pages on-call engineer
        severities: [critical]
      - name: slack-incidents
        severities: [critical]

Labeling Standard

Establish a cluster-wide labeling convention:

Standard CronJob Labels

apiVersion: batch/v1
kind: CronJob
metadata:
  name: my-job
  namespace: my-namespace
  labels:
    tier: critical          # Required: critical, high, standard
    team: platform          # Required: owning team
    component: backups      # Required: functional component
    monitoring: enabled     # Optional: explicit opt-in
spec:
  schedule: "0 2 * * *"
  # ...

Document this in your organization’s runbooks and enforce it via admission controllers or CI checks.

Production/Non-Production Split

Monitor production and non-production environments separately with different SLA requirements.

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: production-cluster
  namespace: cronjob-guardian
spec:
  selector:
    allNamespaces: true
    namespaceSelector:
      matchLabels:
        environment: production  # Only prod namespaces
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  sla:
    enabled: true
    minSuccessRate: 99
    windowDays: 7
  alerting:
    channelRefs:
      - name: pagerduty-prod
        severities: [critical]
      - name: slack-prod-ops
        severities: [critical, warning]

Setup

Label namespaces by environment

kubectl label namespace production environment=production
kubectl label namespace staging environment=staging
kubectl label namespace dev environment=dev

Apply both monitors

kubectl apply -f production-cluster.yaml
kubectl apply -f non-production-cluster.yaml

Verify separation

# Check production monitor
kubectl describe cronjobmonitor production-cluster -n cronjob-guardian

# Check non-production monitor
kubectl describe cronjobmonitor non-production-cluster -n cronjob-guardian

Ensure each monitor only sees jobs in the appropriate namespaces.

Excluding System Namespaces

Prevent monitoring of system CronJobs in kube-system, kube-public, etc.

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: application-jobs-only
  namespace: cronjob-guardian
spec:
  selector:
    allNamespaces: true
    # Require opt-in label
    matchLabels:
      monitoring.guardian.illenium.net/enabled: "true"
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  alerting:
    channelRefs:
      - name: slack-ops

Approach 1: Opt-In with Labels

Only monitor CronJobs that explicitly opt in with a label:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: my-job
  labels:
    monitoring.guardian.illenium.net/enabled: "true"  # Explicit opt-in

System jobs without this label won’t be monitored.

Approach 2: Namespace Label Filter

Only monitor application namespaces:

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: app-jobs-only
  namespace: cronjob-guardian
spec:
  selector:
    allNamespaces: true
    namespaceSelector:
      matchLabels:
        app-namespace: "true"  # Only namespaces with this label
  # ...

Label application namespaces but not system ones:

kubectl label namespace production app-namespace=true
kubectl label namespace staging app-namespace=true
# kube-system remains unlabeled and is ignored

Alert Routing Strategies

Route alerts intelligently based on namespace, team, or component.

Team-Based Routing

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: cluster-monitor
  namespace: cronjob-guardian
spec:
  selector:
    allNamespaces: true
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  alerting:
    channelRefs:
      # Critical alerts to on-call
      - name: pagerduty-oncall
        severities: [critical]
      
      # All alerts to general ops channel
      - name: slack-ops-general
        severities: [critical, warning]
      
      # Platform team alerts
      - name: slack-team-platform
        severities: [critical, warning]
        # Note: Filtering by team label requires custom alert channel logic

Currently, AlertChannels apply to all matched jobs. For per-team routing, create separate CronJobMonitors with namespace or label selectors, or use webhook channels with custom routing logic.

Monitoring at Scale

Performance Considerations

Cluster-wide monitoring can watch thousands of CronJobs. Consider these best practices:

Use Label Selectors

Always filter with labels to avoid monitoring every CronJob:

# Good: Filtered
selector:
  allNamespaces: true
  matchLabels:
    tier: critical

# Risky: Monitors everything
selector:
  allNamespaces: true
  # No filters!

Tune Data Retention

For large-scale monitoring, reduce data retention to manage database size:

dataRetention:
  retentionDays: 30  # Instead of default 90
  storeLogs: false   # Don't store logs for all jobs
  storeEvents: false # Don't store events

Rate Limit Alerts

Configure rate limiting on AlertChannels to prevent alert storms:

apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
  name: slack-ops
spec:
  type: slack
  rateLimiting:
    maxAlertsPerHour: 100
    burstLimit: 10

Use Alert Delays

Prevent alerts for transient failures:

alerting:
  alertDelay: 5m  # Wait 5 minutes before alerting
  suppressDuplicatesFor: 1h

Monitoring Sharded by Team

For very large clusters, create one monitor per team:

Platform Team Monitor

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: team-platform
  namespace: cronjob-guardian
spec:
  selector:
    allNamespaces: true
    matchLabels:
      team: platform
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  alerting:
    channelRefs:
      - name: slack-team-platform

Repeat for each team with their own labels and alert channels. This provides:

Team-specific SLA requirements
Isolated alert channels per team
Better performance (smaller job sets per monitor)
Team autonomy in configuring their monitoring

Next Steps

Advanced Features

SLA tracking, regression detection, and maintenance windows

Alert Channels

Configure PagerDuty, Slack, and other channels

Data Retention

Manage historical data at scale

RBAC Setup

Configure cluster-wide permissions

Monitors

Alert Channels

​Overview

​Basic Cluster-Wide Monitor

​What This Does

​Setup Instructions

​Multi-Tier Cluster Monitoring

​Labeling Standard

​Production/Non-Production Split

​Setup

​Excluding System Namespaces

​Approach 1: Opt-In with Labels

​Approach 2: Namespace Label Filter

​Alert Routing Strategies

​Monitoring at Scale

​Performance Considerations

​Monitoring Sharded by Team

​Next Steps

Advanced Features

Alert Channels

Data Retention

RBAC Setup

Build docs developers (and LLMs) love

Overview

Basic Cluster-Wide Monitor

What This Does

Setup Instructions

Multi-Tier Cluster Monitoring

Labeling Standard

Production/Non-Production Split

Setup

Excluding System Namespaces

Approach 1: Opt-In with Labels

Approach 2: Namespace Label Filter

Alert Routing Strategies

Monitoring at Scale

Performance Considerations

Monitoring Sharded by Team

Next Steps