Service Level Agreements (SLAs) ensure your CronJobs meet reliability and performance standards. Guardian tracks success rates, execution durations, and automatically detects performance regressions.
SLA Tracking Overview
Guardian tracks these SLA metrics:
Success Rate : Percentage of successful runs over a rolling window
Duration Metrics : P50, P95, P99 execution times
Duration Regression : Automatic detection of performance degradation
Max Duration : Alert if jobs exceed a time threshold
Basic SLA Configuration
Enable SLA tracking with default settings:
apiVersion : guardian.illenium.net/v1alpha1
kind : CronJobMonitor
metadata :
name : my-monitor
namespace : production
spec :
sla :
enabled : true
minSuccessRate : 95 # Alert if success rate drops below 95%
windowDays : 7 # Calculate over last 7 days
Default values:
minSuccessRate: 95%
windowDays: 7 days
Success Rate Thresholds
Set different success rate requirements based on job criticality:
Critical Backups (100% Required)
apiVersion : guardian.illenium.net/v1alpha1
kind : CronJobMonitor
metadata :
name : database-backups
namespace : databases
spec :
selector :
matchLabels :
type : backup
sla :
enabled : true
minSuccessRate : 100 # Backups must never fail
windowDays : 7
alerting :
channelRefs :
- name : pagerduty-dba
severities : [ critical ]
Production Jobs (95% Required)
apiVersion : guardian.illenium.net/v1alpha1
kind : CronJobMonitor
metadata :
name : prod-reports
namespace : production
spec :
sla :
enabled : true
minSuccessRate : 95
windowDays : 7
Dev/Test Jobs (Lower Threshold)
apiVersion : guardian.illenium.net/v1alpha1
kind : CronJobMonitor
metadata :
name : dev-jobs
namespace : development
spec :
sla :
enabled : true
minSuccessRate : 80 # More lenient for dev
windowDays : 7
Duration Tracking
Guardian calculates duration percentiles (P50, P95, P99) for all monitored CronJobs.
Viewing Duration Metrics
kubectl get cronjobmonitor my-monitor -o yaml
Status includes duration metrics:
status :
cronJobs :
- name : daily-report
namespace : production
metrics :
successRate : 98.5
totalRuns : 200
successfulRuns : 197
failedRuns : 3
avgDurationSeconds : 125.3
p50DurationSeconds : 120.0
p95DurationSeconds : 180.0
p99DurationSeconds : 210.0
Maximum Duration Alerts
Alert if any job execution exceeds a time threshold:
apiVersion : guardian.illenium.net/v1alpha1
kind : CronJobMonitor
metadata :
name : fast-jobs
namespace : production
spec :
sla :
enabled : true
maxDuration : 30m # Alert if any run takes longer than 30 minutes
Enable max duration tracking
Set maxDuration in your monitor spec.
Guardian monitors job durations
Every job execution is timed and compared to the threshold.
Receive alerts for slow runs
If a job exceeds maxDuration, you’ll get an alert with the actual duration and suggested fixes.
Duration Regression Detection
Automatically detect when jobs slow down over time using baseline comparison.
apiVersion : guardian.illenium.net/v1alpha1
kind : CronJobMonitor
metadata :
name : performance-sensitive
namespace : production
spec :
sla :
enabled : true
durationRegressionThreshold : 50 # Alert if P95 increases by 50%
durationBaselineWindowDays : 14 # Compare to last 14 days baseline
How Regression Detection Works
Baseline calculation
Guardian calculates the P95 duration over the last durationBaselineWindowDays (e.g., 14 days).
Current P95 comparison
The current P95 (last 7 days) is compared to the baseline.
Threshold check
If the current P95 is higher than baseline by more than durationRegressionThreshold percent, an alert is triggered.
Example Regression Alert
If your job’s P95 goes from 2 minutes to 3.5 minutes:
Duration Regression Detected: production/data-pipeline
Current P95: 3m30s
Baseline P95: 2m00s
Increase: 75% (exceeds threshold of 50%)
Suggested Action:
Investigate recent changes to the job or increased data volume.
Configuring Regression Sensitivity
sla :
# Conservative (less sensitive)
durationRegressionThreshold : 100 # Alert if P95 doubles
# Moderate (default)
durationRegressionThreshold : 50 # Alert if P95 increases 50%
# Aggressive (very sensitive)
durationRegressionThreshold : 25 # Alert if P95 increases 25%
Set durationRegressionThreshold based on your job’s normal variance. If your job duration varies naturally (e.g., processing variable data), use a higher threshold to avoid false positives.
SLA Window Configuration
The windowDays parameter defines the rolling window for calculating success rate and duration metrics.
Short Window (3 Days)
Quick detection of issues:
sla :
enabled : true
minSuccessRate : 95
windowDays : 3 # More sensitive to recent failures
Standard Window (7 Days)
Balanced view:
sla :
enabled : true
minSuccessRate : 95
windowDays : 7 # Default, recommended for most jobs
Long Window (30 Days)
Long-term trends:
sla :
enabled : true
minSuccessRate : 90
windowDays : 30 # For jobs that run infrequently
Complete SLA Example
Here’s a comprehensive configuration with all SLA features:
apiVersion : guardian.illenium.net/v1alpha1
kind : CronJobMonitor
metadata :
name : data-pipeline
namespace : production
spec :
selector :
matchLabels :
pipeline : etl
# Dead-man's switch (separate from SLA)
deadManSwitch :
enabled : true
maxTimeSinceLastSuccess : 25h
# SLA configuration
sla :
enabled : true
# Success rate requirement
minSuccessRate : 95 # Alert if success rate drops below 95%
windowDays : 7 # Calculate over last 7 days
# Duration limits
maxDuration : 30m # Alert if any run exceeds 30 minutes
# Regression detection
durationRegressionThreshold : 50 # Alert if P95 increases by 50%
durationBaselineWindowDays : 14 # Compare to last 14 days
# Alerting
alerting :
channelRefs :
- name : slack-data-team
severities : [ critical , warning ]
severityOverrides :
slaBreached : warning
durationRegression : warning
Viewing SLA Metrics
Via kubectl
# View monitor status with metrics
kubectl describe cronjobmonitor data-pipeline -n production
# Extract metrics as JSON
kubectl get cronjobmonitor data-pipeline -n production -o json | jq '.status.cronJobs[].metrics'
Via Dashboard
The web dashboard provides:
Success Rate Charts : Visual trends over time
Duration Heatmaps : P50/P95/P99 over the SLA window
Regression Alerts : Highlighted when regressions are detected
Historical Comparison : Compare current vs baseline metrics
Access at http://localhost:8080 (or your configured API port).
Via API
# Get CronJob details with SLA metrics
curl http://localhost:8080/api/v1/cronjobs/production/daily-report
Response includes:
{
"name" : "daily-report" ,
"namespace" : "production" ,
"metrics" : {
"successRate7d" : 98.5 ,
"totalRuns7d" : 200 ,
"successfulRuns7d" : 197 ,
"failedRuns7d" : 3 ,
"avgDurationSeconds" : 125.3 ,
"p50DurationSeconds" : 120.0 ,
"p95DurationSeconds" : 180.0 ,
"p99DurationSeconds" : 210.0 ,
"successRate30d" : 97.2
}
}
SLA Alerting
When SLA thresholds are breached, Guardian sends alerts with:
Current Metrics : Success rate, P95 duration, etc.
Threshold Details : What threshold was breached
Historical Context : Comparison to baseline
Suggested Actions : Based on the type of breach
Example SLA Breach Alert
SLA Breached: production/daily-report
Success Rate: 89.2% (threshold: 95%)
Window: Last 7 days
Failed Runs: 13 / 120 total
Recent Failures:
- 2024-01-15 03:00: Exit code 1 (Database connection timeout)
- 2024-01-14 03:00: Exit code 1 (Database connection timeout)
- 2024-01-13 03:00: Exit code 137 (OOMKilled)
Suggested Action:
Investigate database connectivity and review memory usage trends.
Disabling SLA Tracking
To disable SLA tracking for specific monitors:
apiVersion : guardian.illenium.net/v1alpha1
kind : CronJobMonitor
metadata :
name : no-sla
namespace : development
spec :
sla :
enabled : false # Disable SLA tracking
Or omit the sla section entirely (defaults to enabled with standard thresholds).
Data Retention
SLA metrics are calculated from execution history stored in the database. Configure retention:
apiVersion : guardian.illenium.net/v1alpha1
kind : CronJobMonitor
metadata :
name : my-monitor
spec :
dataRetention :
retentionDays : 60 # Keep 60 days of history for SLA calculations
Global retention (in config.yaml):
history-retention :
default-days : 30
max-days : 90
If your SLA window is 30 days, ensure retentionDays is at least 30 days (or longer for baseline calculations).
Best Practices
Match SLA to Criticality Use 100% for backups, 95% for production, 80-90% for dev/test.
Set Realistic Duration Limits Base maxDuration on historical P99, not P50, to avoid false positives.
Use Regression Detection Enable durationRegressionThreshold to catch gradual performance degradation.
Align Windows Keep windowDays and durationBaselineWindowDays proportional (e.g., 7 and 14).
Next Steps
Maintenance Windows Suppress SLA alerts during planned maintenance
Dashboard Visualize SLA metrics and trends