Skip to main content

Overview

Cluster-wide monitoring allows platform teams to maintain visibility across all CronJobs in the cluster from a single CronJobMonitor resource. This approach is ideal for:
  • SRE/Platform teams responsible for overall cluster health
  • Monitoring critical infrastructure jobs across all teams
  • Enforcing SLA standards organization-wide

Basic Cluster-Wide Monitor

The simplest cluster-wide configuration monitors all CronJobs with a specific label.
monitors/cluster-wide.yaml
# Monitor all CronJobs cluster-wide
# Watches all namespaces with optional label filtering
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: cluster-wide-monitor
  namespace: cronjob-guardian
spec:
  selector:
    # Watch all namespaces
    allNamespaces: true
    # Optionally filter by labels
    matchLabels:
      tier: critical
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  alerting:
    channelRefs:
      - name: pagerduty-critical
        severities: [critical]
      - name: slack-ops
        severities: [critical, warning]

What This Does

  • Watches every namespace in the cluster
  • Monitors CronJobs with the tier: critical label
  • Routes critical alerts to PagerDuty for on-call escalation
  • Sends all alerts to Slack for team visibility

Setup Instructions

1

Verify RBAC permissions

Ensure the CronJob Guardian controller has cluster-wide permissions:
kubectl auth can-i list cronjobs --all-namespaces --as system:serviceaccount:cronjob-guardian:cronjob-guardian-controller-manager
Should return yes. The default Helm installation configures this automatically.
2

Create alert channels

Set up both PagerDuty and Slack alert channels:
# Create PagerDuty routing key secret
kubectl create secret generic pagerduty-key \
  --namespace cronjob-guardian \
  --from-literal=routing-key=<your-routing-key>

# Create Slack webhook secret
kubectl create secret generic slack-webhook \
  --namespace cronjob-guardian \
  --from-literal=url=https://hooks.slack.com/services/...

# Apply alert channels
kubectl apply -f alertchannels/pagerduty.yaml
kubectl apply -f alertchannels/slack.yaml
3

Apply the cluster-wide monitor

kubectl apply -f cluster-wide.yaml
4

Verify discovery

Check which jobs are being monitored:
kubectl describe cronjobmonitor cluster-wide-monitor -n cronjob-guardian
The status shows all discovered jobs:
Status:
  Monitored Jobs:
    - Name: database-backup
      Namespace: production
      Tier: critical
    - Name: etl-pipeline
      Namespace: data-eng
      Tier: critical
  Total: 15

Multi-Tier Cluster Monitoring

Create separate monitors for different criticality tiers with appropriate routing.
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: cluster-critical
  namespace: cronjob-guardian
spec:
  selector:
    allNamespaces: true
    matchLabels:
      tier: critical
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  sla:
    enabled: true
    minSuccessRate: 99.9  # Strict SLA
    windowDays: 7
  alerting:
    channelRefs:
      - name: pagerduty-oncall  # Pages on-call engineer
        severities: [critical]
      - name: slack-incidents
        severities: [critical]

Labeling Standard

Establish a cluster-wide labeling convention:
Standard CronJob Labels
apiVersion: batch/v1
kind: CronJob
metadata:
  name: my-job
  namespace: my-namespace
  labels:
    tier: critical          # Required: critical, high, standard
    team: platform          # Required: owning team
    component: backups      # Required: functional component
    monitoring: enabled     # Optional: explicit opt-in
spec:
  schedule: "0 2 * * *"
  # ...
Document this in your organization’s runbooks and enforce it via admission controllers or CI checks.

Production/Non-Production Split

Monitor production and non-production environments separately with different SLA requirements.
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: production-cluster
  namespace: cronjob-guardian
spec:
  selector:
    allNamespaces: true
    namespaceSelector:
      matchLabels:
        environment: production  # Only prod namespaces
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  sla:
    enabled: true
    minSuccessRate: 99
    windowDays: 7
  alerting:
    channelRefs:
      - name: pagerduty-prod
        severities: [critical]
      - name: slack-prod-ops
        severities: [critical, warning]

Setup

1

Label namespaces by environment

kubectl label namespace production environment=production
kubectl label namespace staging environment=staging
kubectl label namespace dev environment=dev
2

Apply both monitors

kubectl apply -f production-cluster.yaml
kubectl apply -f non-production-cluster.yaml
3

Verify separation

# Check production monitor
kubectl describe cronjobmonitor production-cluster -n cronjob-guardian

# Check non-production monitor
kubectl describe cronjobmonitor non-production-cluster -n cronjob-guardian
Ensure each monitor only sees jobs in the appropriate namespaces.

Excluding System Namespaces

Prevent monitoring of system CronJobs in kube-system, kube-public, etc.
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: application-jobs-only
  namespace: cronjob-guardian
spec:
  selector:
    allNamespaces: true
    # Require opt-in label
    matchLabels:
      monitoring.guardian.illenium.net/enabled: "true"
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  alerting:
    channelRefs:
      - name: slack-ops

Approach 1: Opt-In with Labels

Only monitor CronJobs that explicitly opt in with a label:
apiVersion: batch/v1
kind: CronJob
metadata:
  name: my-job
  labels:
    monitoring.guardian.illenium.net/enabled: "true"  # Explicit opt-in
System jobs without this label won’t be monitored.

Approach 2: Namespace Label Filter

Only monitor application namespaces:
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: app-jobs-only
  namespace: cronjob-guardian
spec:
  selector:
    allNamespaces: true
    namespaceSelector:
      matchLabels:
        app-namespace: "true"  # Only namespaces with this label
  # ...
Label application namespaces but not system ones:
kubectl label namespace production app-namespace=true
kubectl label namespace staging app-namespace=true
# kube-system remains unlabeled and is ignored

Alert Routing Strategies

Route alerts intelligently based on namespace, team, or component.
Team-Based Routing
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: cluster-monitor
  namespace: cronjob-guardian
spec:
  selector:
    allNamespaces: true
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  alerting:
    channelRefs:
      # Critical alerts to on-call
      - name: pagerduty-oncall
        severities: [critical]
      
      # All alerts to general ops channel
      - name: slack-ops-general
        severities: [critical, warning]
      
      # Platform team alerts
      - name: slack-team-platform
        severities: [critical, warning]
        # Note: Filtering by team label requires custom alert channel logic
Currently, AlertChannels apply to all matched jobs. For per-team routing, create separate CronJobMonitors with namespace or label selectors, or use webhook channels with custom routing logic.

Monitoring at Scale

Performance Considerations

Cluster-wide monitoring can watch thousands of CronJobs. Consider these best practices:
Always filter with labels to avoid monitoring every CronJob:
# Good: Filtered
selector:
  allNamespaces: true
  matchLabels:
    tier: critical

# Risky: Monitors everything
selector:
  allNamespaces: true
  # No filters!
For large-scale monitoring, reduce data retention to manage database size:
dataRetention:
  retentionDays: 30  # Instead of default 90
  storeLogs: false   # Don't store logs for all jobs
  storeEvents: false # Don't store events
Configure rate limiting on AlertChannels to prevent alert storms:
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
  name: slack-ops
spec:
  type: slack
  rateLimiting:
    maxAlertsPerHour: 100
    burstLimit: 10
Prevent alerts for transient failures:
alerting:
  alertDelay: 5m  # Wait 5 minutes before alerting
  suppressDuplicatesFor: 1h

Monitoring Sharded by Team

For very large clusters, create one monitor per team:
Platform Team Monitor
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: team-platform
  namespace: cronjob-guardian
spec:
  selector:
    allNamespaces: true
    matchLabels:
      team: platform
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  alerting:
    channelRefs:
      - name: slack-team-platform
Repeat for each team with their own labels and alert channels. This provides:
  • Team-specific SLA requirements
  • Isolated alert channels per team
  • Better performance (smaller job sets per monitor)
  • Team autonomy in configuring their monitoring

Next Steps

Advanced Features

SLA tracking, regression detection, and maintenance windows

Alert Channels

Configure PagerDuty, Slack, and other channels

Data Retention

Manage historical data at scale

RBAC Setup

Configure cluster-wide permissions

Build docs developers (and LLMs) love