Overview
Sol RPC Router continuously monitors backend health by sending periodic RPC requests. Unhealthy backends are automatically removed from the load balancing pool until they recover.Configuration
Health checks are configured in the[health_check] section:
The entire
[health_check] section is optional. If omitted, default values are used.Parameters
Time in seconds between health check probes for each backend.Default: 30 seconds (from
src/config.rs:45)Recommendation: Set based on your tolerance for downtime detection. Shorter intervals detect failures faster but increase overhead.Maximum time in seconds to wait for a health check response before considering it failed.Default: 5 seconds (from
src/config.rs:46)RPC method to use for health check probes.Default:
"getSlot" (from src/config.rs:47)Common options:getSlot- Fast, lightweight checkgetHealth- Explicit health endpoint (if supported)getVersion- Version information check
Choose a method that’s fast and doesn’t require parameters.
getSlot is recommended for most use cases.Number of consecutive failed health checks before marking a backend as unhealthy.Default: 3 (from
src/config.rs:48)Purpose: Prevents transient failures from removing backends from the pool.Example: With default settings, a backend must fail 3 checks in a row (90+ seconds) before being marked unhealthy.Number of consecutive successful health checks before marking an unhealthy backend as healthy again.Default: 2 (from
src/config.rs:49)Purpose: Ensures backends are stable before reintroducing them to the pool.Example: A failed backend must pass 2 checks in a row (60+ seconds) before receiving traffic again.Maximum allowed slot lag behind the network before considering a backend unhealthy.Default: 50 slots (from
src/config.rs:50)Purpose: Detects backends that are syncing or falling behind the network.This is compared against the highest slot seen across all backends. A backend with slot lag exceeding this threshold is marked unhealthy even if responding successfully.
Health Check Lifecycle
Initial State
All backends start in a healthy state when the router launches.Failure Detection
- Health check probe sent every
interval_secs - If response not received within
timeout_secs, counted as failure - After
consecutive_failures_thresholdconsecutive failures, backend marked unhealthy - Unhealthy backends are removed from load balancing pool
Recovery
- Health checks continue for unhealthy backends
- After
consecutive_successes_thresholdconsecutive successes, backend marked healthy - Healthy backends are reintroduced to the load balancing pool
Slot Lag Detection
getSlotresponses are compared across all backends- If a backend’s slot is more than
max_slot_lagbehind the highest seen slot, it’s marked unhealthy - Backend remains unhealthy until it catches up
Examples
Aggressive Health Checks (Fast Failure Detection)
- Checks every 10 seconds
- 2-second timeout
- Mark unhealthy after 2 failures (20+ seconds)
- Mark healthy after 1 success (10+ seconds)
- Low tolerance for slot lag
Conservative Health Checks (Stable Backends)
- Checks every 60 seconds
- 10-second timeout
- Mark unhealthy after 5 failures (300+ seconds / 5 minutes)
- Mark healthy after 3 successes (180+ seconds / 3 minutes)
- High tolerance for slot lag
Conservative settings reduce overhead but increase time to detect failures.
Default Configuration
src/config.rs:42-52).
Monitoring Health Status
Backend health status is exposed via Prometheus metrics:Best Practices
Choose appropriate intervals
Balance between fast failure detection and overhead. 30 seconds is a good starting point.
Set timeout < interval
Ensure
timeout_secs is significantly less than interval_secs to prevent overlapping checks.Use lightweight methods
getSlot is fast and doesn’t require parameters. Avoid methods that query large amounts of data.Tune thresholds for your network
Higher thresholds reduce false positives but increase detection time. Adjust based on your backend reliability.
Monitor slot lag
Set
max_slot_lag based on your tolerance for stale data. Lower values ensure fresher data but may exclude slower nodes.Troubleshooting
Backend Frequently Marked Unhealthy
Symptoms: Backend oscillates between healthy and unhealthy Solutions:- Increase
consecutive_failures_thresholdto tolerate transient failures - Increase
timeout_secsif backend is slow but reliable - Check backend logs for actual issues
Backend Stays Unhealthy After Recovery
Symptoms: Backend is operational but router doesn’t use it Solutions:- Check if slot lag exceeds
max_slot_lag - Decrease
consecutive_successes_thresholdfor faster recovery - Verify backend is actually responding to health check method
Health Checks Timing Out
Symptoms: Frequent timeout errors in logs Solutions:- Increase
timeout_secs(backend may be slow) - Check network latency to backend
- Try a different health check
method
Default Values Reference
Fromsrc/config.rs:42-52: