Skip to main content
This guide covers common issues you may encounter when using CronJob Guardian and how to resolve them.

Monitor Not Discovering CronJobs

Symptoms

CronJobMonitor shows totalCronJobs: 0 even though CronJobs exist in the namespace.

Possible Causes and Solutions

1

Verify the monitor is in Active phase

kubectl get cronjobmonitor my-monitor -n production
If the phase is not Active, check the monitor’s conditions:
kubectl describe cronjobmonitor my-monitor -n production
2

Check the selector configuration

Ensure your selector matches the CronJob labels:
# List CronJobs with labels
kubectl get cronjob -n production --show-labels

# Check if your selector matches
kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 10 selector
3

Verify namespace permissions

For cluster-wide or multi-namespace monitoring, ensure the Guardian service account has proper RBAC permissions:
kubectl auth can-i list cronjobs --as=system:serviceaccount:cronjob-guardian:cronjob-guardian-controller -n production
Should return yes.
4

Check ignored namespaces

Verify the namespace isn’t in the ignored list:
kubectl get configmap cronjob-guardian-config -n cronjob-guardian -o yaml | grep -A 5 ignored-namespaces
If monitoring fails silently, check the controller logs:
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller

Alerts Not Being Sent

Symptoms

Jobs are failing, but no alerts are received in Slack, PagerDuty, etc.

Diagnosis Steps

1

Check if alerts are active in the monitor status

kubectl describe cronjobmonitor my-monitor -n production | grep -A 10 "Active Alerts"
If alerts are listed here, the monitor is detecting failures.
2

Verify AlertChannel is ready

kubectl get alertchannel -A
Ensure the READY column shows true.If false, check the AlertChannel status:
kubectl describe alertchannel slack-alerts
3

Test the alert channel

Send a test alert:
curl -X POST http://localhost:8080/api/v1/channels/slack-alerts/test
Or via the dashboard: Channels → select channel → Send Test Alert.
4

Check alert channel references in the monitor

kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 5 channelRefs
Ensure the channel name matches the AlertChannel resource name.
5

Verify severity routing

alerting:
  channelRefs:
    - name: slack-alerts
      severities: [critical, warning]  # Ensure this includes the alert severity
If you only route critical but the alert is warning, it won’t be sent to that channel.
6

Check rate limiting

kubectl get alertchannel slack-alerts -o yaml | grep -A 5 rateLimiting
If you’re hitting rate limits, increase them or reduce alert volume.

Common Alert Channel Issues

Slack Webhook Invalid

kubectl describe alertchannel slack-alerts
If you see errors like invalid webhook URL or 401 Unauthorized:
  1. Verify the webhook URL is correct
  2. Regenerate the webhook in Slack if necessary
  3. Update the secret:
kubectl delete secret slack-webhook -n cronjob-guardian
kubectl create secret generic slack-webhook \
  --namespace cronjob-guardian \
  --from-literal=url=https://hooks.slack.com/services/YOUR/NEW/URL

PagerDuty Routing Key Invalid

Check the PagerDuty routing key:
kubectl get secret pagerduty-key -n cronjob-guardian -o jsonpath='{.data.routing-key}' | base64 -d
Ensure it matches your PagerDuty service integration key.

Dead-Man’s Switch Not Triggering

Symptoms

CronJob hasn’t run in days, but no dead-man’s switch alert is sent.

Solutions

1

Verify dead-man's switch is enabled

kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 5 deadManSwitch
Ensure enabled: true.
2

Check the time window

deadManSwitch:
  enabled: true
  maxTimeSinceLastSuccess: 25h  # Must exceed CronJob schedule
If your job runs daily, the window should be > 24h (e.g., 25h with buffer).
3

Verify the job has succeeded at least once

Dead-man’s switch requires at least one successful run. Check execution history:
curl http://localhost:8080/api/v1/cronjobs/production/my-job/executions
4

Check if the CronJob is suspended

kubectl get cronjob my-job -n production -o jsonpath='{.spec.suspend}'
If true, and suspendedHandling.pauseMonitoring: true, the dead-man’s switch is paused.

SLA Not Updating

Symptoms

Success rate and duration metrics are stale or show 0%.

Solutions

1

Check if SLA analyzer is running

curl http://localhost:8080/api/v1/health | jq '.analyzerEnabled'
Should return true.
2

Verify execution history is being stored

curl http://localhost:8080/api/v1/cronjobs/production/my-job/executions
If empty, check storage configuration.
3

Check SLA recalculation interval

kubectl get configmap cronjob-guardian-config -n cronjob-guardian -o yaml | grep sla-recalculation-interval
Default is 5 minutes. Metrics update on this schedule.
4

Verify SLA is enabled for the monitor

kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 5 sla

Storage Issues

Database Connection Errors

kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -i "storage\|database"
Common errors:

SQLite: Permission Denied

Ensure the persistent volume has correct permissions:
# values.yaml
storage:
  type: sqlite
  sqlite:
    persistence:
      enabled: true
      storageClass: standard
      accessModes:
        - ReadWriteOnce

PostgreSQL: Connection Refused

Verify the PostgreSQL service is reachable:
kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
  psql -h postgres.default.svc.cluster.local -U guardian -d guardian
If connection fails, check:
  • PostgreSQL is running: kubectl get pod -l app=postgresql
  • Service exists: kubectl get svc postgres
  • Credentials are correct in the secret

Execution History Not Stored

1

Check storage health

curl http://localhost:8080/api/v1/admin/storage-stats
Ensure healthy: true.
2

Verify storage backend is configured

kubectl get configmap cronjob-guardian-config -n cronjob-guardian -o yaml | grep -A 10 storage
3

Check for write errors in logs

kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -i "failed to.*execution"

High Memory Usage

Symptoms

Guardian controller pod is OOMKilled or uses excessive memory.

Solutions

1

Reduce log storage

Disable log storage or reduce max log size:
# values.yaml
storage:
  logStorageEnabled: false  # Or reduce maxLogSizeKB
  maxLogSizeKB: 100  # Default: 200
2

Shorten retention period

historyRetention:
  defaultDays: 7  # Reduce from 30
3

Reduce monitored CronJobs

If monitoring hundreds of CronJobs, consider splitting into multiple monitors or using more specific selectors.
4

Increase memory limits

# values.yaml
resources:
  limits:
    memory: 512Mi  # Increase from default
  requests:
    memory: 256Mi

Controller Crashes or Restarts

Diagnosis

# Check pod status
kubectl get pod -n cronjob-guardian

# View recent logs
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller --tail=100

# Check previous logs if pod restarted
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller --previous

Common Causes

Panic in Reconciliation Loop

Look for panic stack traces in logs. If you find a bug, report it with:
  • Full stack trace
  • CronJobMonitor YAML that triggered the panic
  • Guardian version
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -A 50 "panic"

Leader Election Issues

If running multiple replicas:
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -i "leader"
Ensure only one pod is the leader at a time.

API Server Not Responding

Symptoms

Dashboard is unreachable or API requests timeout.

Solutions

1

Verify API is enabled

kubectl get configmap cronjob-guardian-config -n cronjob-guardian -o yaml | grep -A 5 api
Ensure enabled: true.
2

Check API service

kubectl get svc -n cronjob-guardian cronjob-guardian-api
Verify endpoints exist:
kubectl get endpoints -n cronjob-guardian cronjob-guardian-api
3

Test API from within cluster

kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl http://cronjob-guardian-api.cronjob-guardian.svc.cluster.local:8080/api/v1/health
4

Check controller logs for API errors

kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -i "api server"

Prometheus Metrics Not Scraped

Verify ServiceMonitor

kubectl get servicemonitor -n cronjob-guardian
If using Prometheus Operator, ensure the ServiceMonitor is created and matches your Prometheus selector:
kubectl get prometheus -A -o yaml | grep serviceMonitorSelector

Test Metrics Endpoint

kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl http://cronjob-guardian-controller-metrics.cronjob-guardian.svc.cluster.local:8443/metrics

Common Configuration Mistakes

Incorrect Namespace for AlertChannel

AlertChannels are cluster-scoped, so they don’t have a namespace:
# Wrong
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
  name: slack-alerts
  namespace: production  # Remove this

# Correct
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
  name: slack-alerts  # No namespace

Selector Doesn’t Match Any CronJobs

# List CronJobs with labels
kubectl get cronjob -n production --show-labels

# Check if your matchLabels align
kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 5 matchLabels

Wrong Timezone in Maintenance Windows

Use IANA timezone names, not abbreviations:
# Wrong
timezone: PST

# Correct
timezone: America/Los_Angeles

Getting Help

If you’re still stuck:

Check Logs

kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller --tail=200

Describe Resources

kubectl describe cronjobmonitor my-monitor -n production
kubectl describe alertchannel slack-alerts

Check Events

kubectl get events -n cronjob-guardian --sort-by='.lastTimestamp'

Report Issues

Open an issue on GitHub with:
  • Guardian version
  • Kubernetes version
  • Relevant logs and configuration

Debugging Checklist

Before reporting an issue, gather this information:
# Guardian version
kubectl get deployment -n cronjob-guardian cronjob-guardian-controller -o jsonpath='{.spec.template.spec.containers[0].image}'

# Kubernetes version
kubectl version --short

# Monitor status
kubectl describe cronjobmonitor my-monitor -n production

# Controller logs
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller --tail=200 > guardian-logs.txt

# Health status
curl http://localhost:8080/api/v1/health

# Storage stats
curl http://localhost:8080/api/v1/admin/storage-stats

Next Steps

Production Setup

Best practices for production deployments

API Reference

Complete REST API documentation

Build docs developers (and LLMs) love