SLA Configuration

Service Level Agreements (SLAs) ensure your CronJobs meet reliability and performance standards. Guardian tracks success rates, execution durations, and automatically detects performance regressions.

SLA Tracking Overview

Guardian tracks these SLA metrics:

Success Rate: Percentage of successful runs over a rolling window
Duration Metrics: P50, P95, P99 execution times
Duration Regression: Automatic detection of performance degradation
Max Duration: Alert if jobs exceed a time threshold

Basic SLA Configuration

Enable SLA tracking with default settings:

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: my-monitor
  namespace: production
spec:
  sla:
    enabled: true
    minSuccessRate: 95  # Alert if success rate drops below 95%
    windowDays: 7       # Calculate over last 7 days

Default values:

minSuccessRate: 95%
windowDays: 7 days

Success Rate Thresholds

Set different success rate requirements based on job criticality:

Critical Backups (100% Required)

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: database-backups
  namespace: databases
spec:
  selector:
    matchLabels:
      type: backup
  sla:
    enabled: true
    minSuccessRate: 100  # Backups must never fail
    windowDays: 7
  alerting:
    channelRefs:
      - name: pagerduty-dba
        severities: [critical]

Production Jobs (95% Required)

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: prod-reports
  namespace: production
spec:
  sla:
    enabled: true
    minSuccessRate: 95
    windowDays: 7

Dev/Test Jobs (Lower Threshold)

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: dev-jobs
  namespace: development
spec:
  sla:
    enabled: true
    minSuccessRate: 80  # More lenient for dev
    windowDays: 7

Duration Tracking

Guardian calculates duration percentiles (P50, P95, P99) for all monitored CronJobs.

Viewing Duration Metrics

kubectl get cronjobmonitor my-monitor -o yaml

Status includes duration metrics:

status:
  cronJobs:
    - name: daily-report
      namespace: production
      metrics:
        successRate: 98.5
        totalRuns: 200
        successfulRuns: 197
        failedRuns: 3
        avgDurationSeconds: 125.3
        p50DurationSeconds: 120.0
        p95DurationSeconds: 180.0
        p99DurationSeconds: 210.0

Maximum Duration Alerts

Alert if any job execution exceeds a time threshold:

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: fast-jobs
  namespace: production
spec:
  sla:
    enabled: true
    maxDuration: 30m  # Alert if any run takes longer than 30 minutes

Enable max duration tracking

Set maxDuration in your monitor spec.

Guardian monitors job durations

Every job execution is timed and compared to the threshold.

Receive alerts for slow runs

If a job exceeds maxDuration, you’ll get an alert with the actual duration and suggested fixes.

Duration Regression Detection

Automatically detect when jobs slow down over time using baseline comparison.

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: performance-sensitive
  namespace: production
spec:
  sla:
    enabled: true
    durationRegressionThreshold: 50  # Alert if P95 increases by 50%
    durationBaselineWindowDays: 14   # Compare to last 14 days baseline

How Regression Detection Works

Baseline calculation

Guardian calculates the P95 duration over the last durationBaselineWindowDays (e.g., 14 days).

Current P95 comparison

The current P95 (last 7 days) is compared to the baseline.

Threshold check

If the current P95 is higher than baseline by more than durationRegressionThreshold percent, an alert is triggered.

Example Regression Alert

If your job’s P95 goes from 2 minutes to 3.5 minutes:

Duration Regression Detected: production/data-pipeline

Current P95: 3m30s
Baseline P95: 2m00s
Increase: 75% (exceeds threshold of 50%)

Suggested Action:
Investigate recent changes to the job or increased data volume.

Configuring Regression Sensitivity

sla:
  # Conservative (less sensitive)
  durationRegressionThreshold: 100  # Alert if P95 doubles
  
  # Moderate (default)
  durationRegressionThreshold: 50   # Alert if P95 increases 50%
  
  # Aggressive (very sensitive)
  durationRegressionThreshold: 25   # Alert if P95 increases 25%

Set durationRegressionThreshold based on your job’s normal variance. If your job duration varies naturally (e.g., processing variable data), use a higher threshold to avoid false positives.

SLA Window Configuration

The windowDays parameter defines the rolling window for calculating success rate and duration metrics.

Short Window (3 Days)

Quick detection of issues:

sla:
  enabled: true
  minSuccessRate: 95
  windowDays: 3  # More sensitive to recent failures

Standard Window (7 Days)

Balanced view:

sla:
  enabled: true
  minSuccessRate: 95
  windowDays: 7  # Default, recommended for most jobs

Long Window (30 Days)

Long-term trends:

sla:
  enabled: true
  minSuccessRate: 90
  windowDays: 30  # For jobs that run infrequently

Complete SLA Example

Here’s a comprehensive configuration with all SLA features:

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: data-pipeline
  namespace: production
spec:
  selector:
    matchLabels:
      pipeline: etl
  
  # Dead-man's switch (separate from SLA)
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  
  # SLA configuration
  sla:
    enabled: true
    
    # Success rate requirement
    minSuccessRate: 95          # Alert if success rate drops below 95%
    windowDays: 7               # Calculate over last 7 days
    
    # Duration limits
    maxDuration: 30m            # Alert if any run exceeds 30 minutes
    
    # Regression detection
    durationRegressionThreshold: 50    # Alert if P95 increases by 50%
    durationBaselineWindowDays: 14     # Compare to last 14 days
  
  # Alerting
  alerting:
    channelRefs:
      - name: slack-data-team
        severities: [critical, warning]
    severityOverrides:
      slaBreached: warning
      durationRegression: warning

Viewing SLA Metrics

Via kubectl

# View monitor status with metrics
kubectl describe cronjobmonitor data-pipeline -n production

# Extract metrics as JSON
kubectl get cronjobmonitor data-pipeline -n production -o json | jq '.status.cronJobs[].metrics'

Via Dashboard

The web dashboard provides:

Success Rate Charts: Visual trends over time
Duration Heatmaps: P50/P95/P99 over the SLA window
Regression Alerts: Highlighted when regressions are detected
Historical Comparison: Compare current vs baseline metrics

Access at http://localhost:8080 (or your configured API port).

Via API

# Get CronJob details with SLA metrics
curl http://localhost:8080/api/v1/cronjobs/production/daily-report

Response includes:

{
  "name": "daily-report",
  "namespace": "production",
  "metrics": {
    "successRate7d": 98.5,
    "totalRuns7d": 200,
    "successfulRuns7d": 197,
    "failedRuns7d": 3,
    "avgDurationSeconds": 125.3,
    "p50DurationSeconds": 120.0,
    "p95DurationSeconds": 180.0,
    "p99DurationSeconds": 210.0,
    "successRate30d": 97.2
  }
}

SLA Alerting

When SLA thresholds are breached, Guardian sends alerts with:

Current Metrics: Success rate, P95 duration, etc.
Threshold Details: What threshold was breached
Historical Context: Comparison to baseline
Suggested Actions: Based on the type of breach

Example SLA Breach Alert

SLA Breached: production/daily-report

Success Rate: 89.2% (threshold: 95%)
Window: Last 7 days
Failed Runs: 13 / 120 total

Recent Failures:
- 2024-01-15 03:00: Exit code 1 (Database connection timeout)
- 2024-01-14 03:00: Exit code 1 (Database connection timeout)
- 2024-01-13 03:00: Exit code 137 (OOMKilled)

Suggested Action:
Investigate database connectivity and review memory usage trends.

Disabling SLA Tracking

To disable SLA tracking for specific monitors:

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: no-sla
  namespace: development
spec:
  sla:
    enabled: false  # Disable SLA tracking

Or omit the sla section entirely (defaults to enabled with standard thresholds).

Data Retention

SLA metrics are calculated from execution history stored in the database. Configure retention:

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: my-monitor
spec:
  dataRetention:
    retentionDays: 60  # Keep 60 days of history for SLA calculations

Global retention (in config.yaml):

history-retention:
  default-days: 30
  max-days: 90

If your SLA window is 30 days, ensure retentionDays is at least 30 days (or longer for baseline calculations).

Best Practices

Match SLA to Criticality

Use 100% for backups, 95% for production, 80-90% for dev/test.

Set Realistic Duration Limits

Base maxDuration on historical P99, not P50, to avoid false positives.

Use Regression Detection

Enable durationRegressionThreshold to catch gradual performance degradation.

Align Windows

Keep windowDays and durationBaselineWindowDays proportional (e.g., 7 and 14).

Get Started

Core Concepts

Guides

Operations

SLA Tracking Overview

Basic SLA Configuration

Success Rate Thresholds

Critical Backups (100% Required)

Production Jobs (95% Required)

Dev/Test Jobs (Lower Threshold)

Duration Tracking

Viewing Duration Metrics

Maximum Duration Alerts

Duration Regression Detection

How Regression Detection Works

Example Regression Alert

Configuring Regression Sensitivity

SLA Window Configuration

Short Window (3 Days)

Standard Window (7 Days)

Long Window (30 Days)

Complete SLA Example

Viewing SLA Metrics

Via kubectl

Via Dashboard

Via API

SLA Alerting

Example SLA Breach Alert

Disabling SLA Tracking

Data Retention

Best Practices

Match SLA to Criticality

Set Realistic Duration Limits

Use Regression Detection

Align Windows

Next Steps

Maintenance Windows

Dashboard

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Operations

​SLA Tracking Overview

​Basic SLA Configuration

​Success Rate Thresholds

​Critical Backups (100% Required)

​Production Jobs (95% Required)

​Dev/Test Jobs (Lower Threshold)

​Duration Tracking

​Viewing Duration Metrics

​Maximum Duration Alerts

​Duration Regression Detection

​How Regression Detection Works

​Example Regression Alert

​Configuring Regression Sensitivity

​SLA Window Configuration

​Short Window (3 Days)

​Standard Window (7 Days)

​Long Window (30 Days)

​Complete SLA Example

​Viewing SLA Metrics

​Via kubectl

​Via Dashboard

​Via API

​SLA Alerting

​Example SLA Breach Alert

​Disabling SLA Tracking

​Data Retention

​Best Practices

Match SLA to Criticality

Set Realistic Duration Limits

Use Regression Detection

Align Windows

​Next Steps

Maintenance Windows

Dashboard

Build docs developers (and LLMs) love

SLA Tracking Overview

Basic SLA Configuration

Success Rate Thresholds

Critical Backups (100% Required)

Production Jobs (95% Required)

Dev/Test Jobs (Lower Threshold)

Duration Tracking

Viewing Duration Metrics

Maximum Duration Alerts

Duration Regression Detection

How Regression Detection Works

Example Regression Alert

Configuring Regression Sensitivity

SLA Window Configuration

Short Window (3 Days)

Standard Window (7 Days)

Long Window (30 Days)

Complete SLA Example

Viewing SLA Metrics

Via kubectl

Via Dashboard

Via API

SLA Alerting

Example SLA Breach Alert

Disabling SLA Tracking

Data Retention

Best Practices

Next Steps