Overview
CronJob Guardian exports Prometheus metrics for monitoring CronJob health, alert delivery, and operator performance. The metrics endpoint is served on port 8443 (HTTPS) by default.Metrics Endpoint
Default address:https://<pod-ip>:8443/metrics
Configuration:
Exported Metrics
All metrics are defined ininternal/metrics/metrics.go.
CronJob Success Rate
Metric:cronjob_guardian_success_rate
Type: Gauge
Description: Success rate of monitored CronJobs (0-100)
Labels:
namespace- CronJob namespacecronjob- CronJob namemonitor- CronJobMonitor name
CronJob Duration
Metric:cronjob_guardian_duration_seconds
Type: Gauge
Description: Duration metrics for monitored CronJobs at different percentiles
Labels:
namespace- CronJob namespacecronjob- CronJob namepercentile- Duration percentile:avg,p50,p95,p99
Alerts Total
Metric:cronjob_guardian_alerts_total
Type: Counter
Description: Total number of alerts successfully sent
Labels:
namespace- CronJob namespacecronjob- CronJob nametype- Alert type:JobFailed,SLABreached,DeadManSwitch,DurationRegression,JobSuspendedseverity- Alert severity:critical,warning,infochannel- Alert channel name (e.g.,slack-alerts,pagerduty-oncall)
Alerts Failed Total
Metric:cronjob_guardian_alerts_failed_total
Type: Counter
Description: Total number of alerts that failed to send
Labels:
namespace- CronJob namespacecronjob- CronJob nametype- Alert typeseverity- Alert severitychannel- Alert channel name
Executions Total
Metric:cronjob_guardian_executions_total
Type: Counter
Description: Total number of job executions recorded
Labels:
namespace- CronJob namespacecronjob- CronJob namestatus- Execution status:success,failure
Active Alerts
Metric:cronjob_guardian_active_alerts
Type: Gauge
Description: Number of currently active (unresolved) alerts
Labels:
namespace- CronJob namespacecronjob- CronJob nameseverity- Alert severity
Controller-Runtime Metrics
The operator also exports standard controller-runtime metrics:Controller Reconciliation Metrics
controller_runtime_reconcile_total- Total reconciliations per controllercontroller_runtime_reconcile_errors_total- Failed reconciliationscontroller_runtime_reconcile_time_seconds- Reconciliation duration histogram
controller- Controller name:CronJobMonitor,AlertChannel,JobHandlerresult- Result:success,error,requeue
Workqueue Metrics
workqueue_depth- Current depth of workqueueworkqueue_adds_total- Total number of adds to workqueueworkqueue_queue_duration_seconds- Time spent in queueworkqueue_work_duration_seconds- Time spent processing items
name- Workqueue name (controller name)
Go Runtime Metrics
go_goroutines- Number of goroutinesgo_memstats_alloc_bytes- Allocated memorygo_memstats_heap_inuse_bytes- Heap memory in usego_gc_duration_seconds- GC pause duration
ServiceMonitor Configuration
For Prometheus Operator, use a ServiceMonitor resource: File:config/prometheus/monitor.yaml
TLS Certificate Setup (Production)
For production, use cert-manager to manage metrics TLS certificates:- Install cert-manager:
- Create Certificate:
- Configure operator:
- Update ServiceMonitor:
Grafana Dashboard
A pre-built Grafana dashboard is available for visualizing CronJob Guardian metrics.Dashboard Features
Overview Panel:- Total monitored CronJobs
- Overall success rate
- Active alerts count
- Alert delivery success rate
- Success rate by CronJob (time series)
- Execution count by status (stacked bar)
- Duration percentiles (P50, P95, P99)
- Recent failures table
- Alerts sent by type and severity
- Alert delivery failures by channel
- Active alerts by CronJob
- Alert rate over time
- Controller reconciliation rate
- Reconciliation errors
- Workqueue depth
- Memory and CPU usage
Import Dashboard
You can create a custom Grafana dashboard using the queries documented above. Here’s how to import a dashboard: Import steps:- Open Grafana
- Navigate to Dashboards > Import
- Create a new dashboard or paste JSON content
- Select Prometheus data source
- Add panels using the example queries from this documentation
- Save the dashboard
Example Queries
CronJob success rate over time:Alerting Rules
Recommended Prometheus alerting rules:Authentication and Authorization
Whenmetrics.secure: true, the metrics endpoint requires authentication.
Token Authentication
Prometheus must provide a service account token:Disable Authentication (Development)
For development clusters, you can disable authentication:Network Policies
Restrict metrics endpoint access using NetworkPolicies: File:config/network-policy/allow-metrics-traffic.yaml
Troubleshooting
Metrics endpoint not accessible
Check if metrics are enabled:Authentication failures
Check RBAC:Missing metrics
Verify controllers are running:- Success rates update every 5 minutes (SLA scheduler)
- Execution counts update on job completion
- Alert metrics update when alerts are sent
High cardinality
Problem: Too many unique label combinations cause high memory usage. Solution:- Limit number of monitored CronJobs
- Use namespace selectors to reduce scope
- Aggregate metrics in queries instead of labels
Best Practices
- Enable ServiceMonitor - Use Prometheus Operator for automatic discovery
- Use TLS certificates - Secure metrics endpoint with cert-manager
- Set up alerting rules - Alert on low success rates and delivery failures
- Monitor operator health - Track reconciliation errors and resource usage
- Create dashboards - Visualize CronJob health and alert trends
- Network policies - Restrict metrics access to monitoring namespace
- Retention policies - Configure appropriate Prometheus retention
- High availability - Monitor from multiple Prometheus instances