Monitor Not Discovering CronJobs
Symptoms
CronJobMonitor showstotalCronJobs: 0 even though CronJobs exist in the namespace.
Possible Causes and Solutions
Verify namespace permissions
For cluster-wide or multi-namespace monitoring, ensure the Guardian service account has proper RBAC permissions:Should return
yes.Alerts Not Being Sent
Symptoms
Jobs are failing, but no alerts are received in Slack, PagerDuty, etc.Diagnosis Steps
Check if alerts are active in the monitor status
Verify AlertChannel is ready
READY column shows true.If false, check the AlertChannel status:Test the alert channel
Send a test alert:Or via the dashboard: Channels → select channel → Send Test Alert.
Check alert channel references in the monitor
Verify severity routing
critical but the alert is warning, it won’t be sent to that channel.Common Alert Channel Issues
Slack Webhook Invalid
invalid webhook URL or 401 Unauthorized:
- Verify the webhook URL is correct
- Regenerate the webhook in Slack if necessary
- Update the secret:
PagerDuty Routing Key Invalid
Check the PagerDuty routing key:Dead-Man’s Switch Not Triggering
Symptoms
CronJob hasn’t run in days, but no dead-man’s switch alert is sent.Solutions
Verify the job has succeeded at least once
Dead-man’s switch requires at least one successful run. Check execution history:
SLA Not Updating
Symptoms
Success rate and duration metrics are stale or show 0%.Solutions
Storage Issues
Database Connection Errors
SQLite: Permission Denied
Ensure the persistent volume has correct permissions:PostgreSQL: Connection Refused
Verify the PostgreSQL service is reachable:- PostgreSQL is running:
kubectl get pod -l app=postgresql - Service exists:
kubectl get svc postgres - Credentials are correct in the secret
Execution History Not Stored
High Memory Usage
Symptoms
Guardian controller pod is OOMKilled or uses excessive memory.Solutions
Reduce monitored CronJobs
If monitoring hundreds of CronJobs, consider splitting into multiple monitors or using more specific selectors.
Controller Crashes or Restarts
Diagnosis
Common Causes
Panic in Reconciliation Loop
Look for panic stack traces in logs. If you find a bug, report it with:- Full stack trace
- CronJobMonitor YAML that triggered the panic
- Guardian version
Leader Election Issues
If running multiple replicas:API Server Not Responding
Symptoms
Dashboard is unreachable or API requests timeout.Solutions
Prometheus Metrics Not Scraped
Verify ServiceMonitor
Test Metrics Endpoint
Common Configuration Mistakes
Incorrect Namespace for AlertChannel
AlertChannels are cluster-scoped, so they don’t have a namespace:Selector Doesn’t Match Any CronJobs
Wrong Timezone in Maintenance Windows
Use IANA timezone names, not abbreviations:Getting Help
If you’re still stuck:Check Logs
Describe Resources
Check Events
Report Issues
Open an issue on GitHub with:
- Guardian version
- Kubernetes version
- Relevant logs and configuration
Debugging Checklist
Before reporting an issue, gather this information:Next Steps
Production Setup
Best practices for production deployments
API Reference
Complete REST API documentation