Skip to main content

Overview

Hot reload enables configuration changes to take effect without restarting pods. The client libraries automatically detect when values files are updated and reload them with validation, providing a seamless update experience. Changes propagate through the system in minutes, with full observability into timing and delays.

Reload lifecycle

The complete reload cycle from code commit to active values:
1. Update values.yaml in sentry-options-automator
   └─> Commit and merge to main

2. CI validates and generates ConfigMaps
   └─> sentry-options-cli write (generates values.json)

3. CD applies ConfigMaps to clusters
   └─> kubectl apply -f sentry-options-{namespace}.yaml

4. Kubelet syncs ConfigMap to pod volumes
   └─> ~1-2 minutes (kubelet sync period)
   └─> Updates /etc/sentry-options/values/{namespace}/values.json

5. Client library detects file change
   └─> ~5 seconds (polling interval)
   └─> Triggers reload and validation

6. New values active in application
   └─> Total latency: ~1-2 minutes
No pod restart is required. The application continues running with updated configuration.

File watching implementation

Polling strategy

The client libraries use polling instead of filesystem events (inotify/FSEvents):
  • Polling interval: 5 seconds (POLLING_DELAY constant in lib.rs:30)
  • Mechanism: Check modification time of all values.json files
  • Reload trigger: Any file mtime change triggers full reload
Why polling?
  • Works reliably with Kubernetes ConfigMaps (which don’t trigger inotify events)
  • Compatible with NFS and other virtual filesystems
  • Simple implementation with predictable behavior
  • 5 seconds is acceptable overhead given ~1-2 min ConfigMap propagation

Watcher thread

The ValuesWatcher struct (lib.rs:483-672) runs a background thread:
pub struct ValuesWatcher {
    stop_signal: Arc<AtomicBool>,
    thread: Option<JoinHandle<()>>,
}
Lifecycle:
  1. Created during Options::new() initialization
  2. Spawns background thread named "sentry-options-watcher"
  3. Thread polls every 5 seconds until stopped
  4. Automatically stopped when ValuesWatcher is dropped
Thread safety:
  • Values stored in Arc<RwLock<ValuesByNamespace>>
  • Multiple readers can access values concurrently
  • Writer (reload thread) gets exclusive access during updates
  • If thread panics, it’s caught and logged (lib.rs:509-515)

Modification time tracking

The watcher tracks the most recent modification time across all namespace values.json files:
// From lib.rs:548-583 (simplified)
fn get_mtime(values_dir: &Path) -> Option<SystemTime> {
    let mut latest_mtime = None;
    
    for entry in fs::read_dir(values_dir)? {
        let values_file = entry.path().join("values.json");
        if let Ok(metadata) = fs::metadata(&values_file) {
            let mtime = metadata.modified()?;
            if latest_mtime.is_none() || mtime > latest_mtime {
                latest_mtime = Some(mtime);
            }
        }
    }
    
    latest_mtime
}
If any values.json file has a newer mtime than previously seen, all namespaces are reloaded.

Reload behavior

Atomic updates

When a change is detected:
  1. Load all values from disk into new HashMap
  2. Validate all namespaces against schemas
  3. If all valid: Replace entire values map atomically
  4. If any invalid: Keep old values entirely, log error
// From lib.rs:586-610 (simplified)
fn reload_values(
    values_path: &Path,
    registry: &SchemaRegistry,
    values: &Arc<RwLock<ValuesByNamespace>>,
) {
    match registry.load_values_json(values_path) {
        Ok((new_values, generated_at_by_namespace)) => {
            // Atomic update: acquire write lock and replace entire map
            let mut guard = values.write().unwrap();
            *guard = new_values;
            
            emit_reload_spans(...);
        }
        Err(e) => {
            // Keep old values on error
            eprintln!("Failed to reload values: {}", e);
        }
    }
}
If any namespace fails validation, all old values are kept. This prevents partial updates with inconsistent state.

Validation during reload

All values are re-validated against schemas on every reload:
  1. Read all {namespace}/values.json files
  2. Parse JSON and extract options object
  3. Validate each namespace’s options against its schema
  4. If validation passes for all namespaces, update values
  5. If any validation fails, keep old values and log error
This ensures the application never sees invalid configuration, even if a bad ConfigMap is deployed.

Reader/writer synchronization

The values map uses RwLock for concurrent access:
  • Readers (application code calling opts.get()): Multiple concurrent reads allowed
  • Writer (reload thread): Exclusive write access blocks all readers
  • Poisoning handled: If lock is poisoned (writer panicked), readers still work (lib.rs:79)
Potential starvation: With a steady stream of readers, the writer may wait to acquire the lock. In practice, this is rare because reads are fast.

Propagation timing

ConfigMap propagation delay

Kubernetes takes time to sync ConfigMap updates to pods:
  • Kubelet sync period: Default ~1 minute (configurable)
  • Cache TTL: Additional delay from kubelet cache
  • Total typical delay: 1-2 minutes from kubectl apply to file update
This is a Kubernetes limitation, not specific to sentry-options.

Total end-to-end latency

ConfigMap update
  ↓  ~1-2 minutes (kubelet sync)
File modified on disk
  ↓  ~0-5 seconds (polling interval)
Change detected
  ↓  ~10-50 ms (reload + validation)
New values active

Total: ~1-2 minutes

Observability metrics

The reload process emits detailed timing metrics via Sentry transactions.

Observability

Sentry integration

The validation library includes a dedicated Sentry Hub for tracking reloads (lib.rs:42-63):
const SENTRY_OPTIONS_DSN: &str = 
    "https://[email protected]/4510750163927040";

static SENTRY_HUB: OnceLock<Arc<sentry::Hub>> = OnceLock::new();
Key points:
  • Completely isolated from host application’s Sentry setup
  • Uses separate DSN and client configuration
  • 100% sample rate for all reload transactions
  • Disabled in tests (empty DSN)

Reload transactions

One transaction emitted per namespace on each reload (lib.rs:612-644):
fn emit_reload_spans(
    namespaces: &[String],
    reload_duration: Duration,
    generated_at_by_namespace: &HashMap<String, String>,
) {
    let hub = get_sentry_hub();
    let applied_at = Utc::now();
    
    for namespace in namespaces {
        let transaction = hub.start_transaction(
            TransactionContext::new(namespace, "sentry_options.reload")
        );
        
        transaction.set_data("reload_duration_ms", reload_duration_ms);
        transaction.set_data("applied_at", applied_at.to_rfc3339());
        transaction.set_data("generated_at", generated_at_timestamp);
        transaction.set_data("propagation_delay_secs", delay_secs);
        
        transaction.finish();
    }
}

Available metrics

MetricDescriptionSource
reload_duration_msTime to load and validate valuesMeasured in reload function
generated_atWhen ConfigMap was generatedFrom values.json metadata
applied_atWhen application loaded valuesCurrent timestamp
propagation_delay_secsTime from generation to applicationCalculated: applied_at - generated_at
Transaction name: Namespace (e.g., "seer", "relay") Transaction type: "sentry_options.reload" Sample rate: 100% (all reloads tracked)

Propagation delay tracking

The generated_at timestamp is embedded in values.json by the CLI:
{
  "options": { ... },
  "generated_at": "2024-01-21T18:30:00.123456+00:00"
}
The client library calculates the delay:
if let Ok(generated_time) = DateTime::parse_from_rfc3339(ts) {
    let delay_secs = (applied_at - generated_time.with_timezone(&Utc))
        .num_milliseconds() as f64 / 1000.0;
    transaction.set_data("propagation_delay_secs", delay_secs.into());
}
This reveals:
  • How long ConfigMaps took to propagate to pods
  • Bottlenecks in the deployment pipeline
  • Whether kubelet sync is slower than expected

Error handling

Missing values directory

If the values directory doesn’t exist:
  • Watcher creation: Logs warning but continues (lib.rs:496-498)
  • Polling: Returns None for mtime, no reload triggered
  • Application: Uses schema defaults for all options
Set SENTRY_OPTIONS_SUPPRESS_MISSING_DIR=1 to suppress warnings.

Validation failures

If reloaded values fail validation:
  1. Error logged to stderr:
    Failed to reload values from /etc/sentry-options/values: 
    Value error for seer:
        feature.rate-limit "invalid" is not of type "integer"
    
  2. Old values retained - application continues with previous configuration
  3. No Sentry transaction emitted (only on successful reload)
This prevents bad ConfigMaps from breaking running applications.

Thread panics

If the watcher thread panics:
let result = panic::catch_unwind(AssertUnwindSafe(|| {
    Self::run(thread_signal, thread_path, thread_registry, thread_values);
}));
if let Err(e) = result {
    eprintln!("Watcher thread panicked with: {:?}", e);
}
  • Panic is caught and logged
  • Thread terminates
  • No new reloads occur
  • Application continues with last known values
There is no automatic restart mechanism for the watcher thread. If it panics and dies, the application continues running but won’t pick up new configuration until restarted.

Lock poisoning

If the reload thread panics while holding the write lock:
let values_guard = self.values.read()
    .unwrap_or_else(|poisoned| poisoned.into_inner());
Readers recover from poisoned locks and continue accessing values. This prevents a panic in the reload thread from breaking the entire application.

Testing hot reload

Local testing

To test hot reload locally:
  1. Create test values:
    mkdir -p sentry-options/values/test
    cat > sentry-options/values/test/values.json << 'EOF'
    {
      "options": {
        "feature.enabled": false
      }
    }
    EOF
    
  2. Start your application:
    from sentry_options import init, options
    init()
    opts = options('test')
    
    while True:
        print(f"enabled: {opts.get('feature.enabled')}")
        time.sleep(1)
    
  3. Modify values while running:
    # In another terminal
    cat > sentry-options/values/test/values.json << 'EOF'
    {
      "options": {
        "feature.enabled": true
      }
    }
    EOF
    
  4. Observe reload:
    • Within 5 seconds, application prints enabled: True
    • No restart required

Reload timing tests

The test suite includes timing tests (lib.rs:1319-1386):
#[test]
fn test_reload_values_updates_map() {
    // Create initial values
    let values = Arc::new(RwLock::new(initial_values));
    
    // Modify files on disk
    fs::write(values_dir.join("ns1/values.json"), new_values)?;
    
    // Force reload
    ValuesWatcher::reload_values(&values_dir, &registry, &values);
    
    // Verify new values active
    let guard = values.read().unwrap();
    assert_eq!(guard["ns1"]["enabled"], json!(true));
}
These tests verify:
  • Modification time detection
  • Atomic value updates
  • Old values persist on validation errors
  • Thread creation and termination

Best practices

Graceful value changes

When changing configuration:
  1. Consider active requests: Values can change mid-request
  2. Use circuit breakers: For critical flags that affect traffic routing
  3. Monitor metrics: Watch Sentry transactions to confirm propagation
  4. Test locally first: Verify reload works with your schema

Reload-safe code patterns

# ❌ BAD: Caching option value at module level
FEATURE_ENABLED = opts.get('feature.enabled')

def process_request():
    if FEATURE_ENABLED:  # Stale value, won't update on reload
        ...

# ✅ GOOD: Read option value when needed
def process_request():
    if opts.get('feature.enabled'):  # Fresh value on every call
        ...

Monitoring reload health

Query Sentry transactions to track reload behavior:
  • Propagation delays: Check propagation_delay_secs distribution
  • Reload frequency: Count transactions per namespace
  • Reload duration: Monitor reload_duration_ms for performance
Alert on:
  • Propagation delays > 5 minutes (indicates kubelet sync issues)
  • Reload duration > 1 second (indicates slow disk or large values)
  • Missing reload transactions (watcher thread may be dead)

Build docs developers (and LLMs) love