Troubleshooting - Sol RPC Router

Common Issues

Startup Failures

Redis Connection Failed

Symptom:

Failed to initialize Redis KeyStore: Connection refused (os error 111)

Cause: Redis is not running or not accessible Solutions:

Start Redis:

# Using systemd
sudo systemctl start redis

# Using Docker
docker run -d -p 6379:6379 redis

Verify Redis is listening:
```
redis-cli ping
# Should return: PONG
```
Check Redis URL in config:
```
redis_url = "redis://127.0.0.1:6379/0"
```

Test connection:

redis-cli -u redis://127.0.0.1:6379/0 ping

Port Already in Use

Symptom:

Failed to bind HTTP server: Address already in use (os error 98)

Cause: Another process is using the configured port Solutions:

Find the process using the port:

# HTTP port (28899)
lsof -i :28899
sudo netstat -tlnp | grep 28899

# WebSocket port (28900)
lsof -i :28900

# Metrics port (28901)
lsof -i :28901

Kill the conflicting process:
```
kill <pid>
```

Or change the port in config:

port = 28899        # HTTP/WS ports
metrics_port = 28901

Configuration Validation Failed

Symptom:

Failed to load router configuration: Backend weight must be greater than 0

Cause: Invalid configuration values Common validation errors:

Empty Redis URL
No backends defined
Duplicate backend labels
Backend weight ≤ 0
Method route referencing non-existent backend
Invalid timeout value (≤ 0)

Solution: Review configuration against validation rules:

# Valid example
[[backends]]
label = "mainnet-primary"  # Must be unique and non-empty
url = "https://api.mainnet-beta.solana.com"  # Must be valid URL
weight = 10  # Must be > 0

[proxy]
timeout_secs = 30  # Must be > 0

[method_routes]
getSlot = "mainnet-primary"  # Must reference existing backend label

Request Failures

Unauthorized (401)

Symptom:

curl http://localhost:28899/?api-key=invalid
# Returns: 401 Unauthorized

Log output:

Invalid API key presented (prefix=invali...)

Solutions:

Verify API key exists:

rpc-admin list
rpc-admin inspect <api-key>

Create a new API key:

rpc-admin create my-client --rate-limit 50

Check key is active:

rpc-admin inspect <api-key>
# Verify: active: true

Rate Limited (429)

Symptom:

curl http://localhost:28899/?api-key=my-key
# Returns: 429 Too Many Requests: Rate limit exceeded

Log output:

API key rate limited (prefix=my-key...)

Solutions:

Check current rate limit:
```
rpc-admin inspect <api-key>
```

Increase rate limit:

rpc-admin update <api-key> --rate-limit 100

Monitor rate limit metrics:

sum(rate(rpc_requests_total{status="429"}[1m]))
sum(rate(rpc_requests_total{status="429"}[1m])) by (owner)

No Healthy Backends (503)

Symptom:

503 Service Unavailable: No healthy backends available

Log output:

No healthy backends available for request

Solutions:

Check backend health status:

curl http://localhost:28899/health | jq

Check health metrics:

rpc_backend_health
sum(rpc_backend_health)  # Should be > 0

Review health check logs:

journalctl -u sol-rpc-router | grep "Health check"

Common health check failures:
- Backend is down
- Backend is too slow (timeout)
- Backend is lagging (slot lag)
- Network connectivity issues

See Backend Health Issues for detailed diagnostics.

Gateway Timeout (504)

Symptom:

504 Gateway Timeout: Upstream request timed out after 30s

Cause: Backend took longer than proxy.timeout_secs to respond Solutions:

Increase proxy timeout:

[proxy]
timeout_secs = 60  # Increase from 30

Reload configuration:
```
kill -SIGHUP $(pidof sol-rpc-router)
```

Check backend performance:

# Test backend directly
time curl -X POST https://api.mainnet-beta.solana.com \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"getHealth"}'

Monitor backend latency:

histogram_quantile(0.99, sum(rate(rpc_request_duration_seconds_bucket[5m])) by (le, backend))

Backend Health Issues

Backend Marked Unhealthy

Log output:

Backend mainnet-primary marked as UNHEALTHY after 3 consecutive failures

Diagnostic steps:

Check last error:

curl http://localhost:28899/health | jq '.backends[] | select(.label == "mainnet-primary")'

Test backend directly:

curl -X POST https://api.mainnet-beta.solana.com \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"getSlot"}'

Check health check configuration:

[health_check]
interval_secs = 30
timeout_secs = 5        # May be too low
method = "getSlot"      # Verify backend supports this
consecutive_failures_threshold = 3

Common failure reasons: Timeout:
```
Health check timed out after 5s
```
Solution: Increase timeout_secs or investigate backend slowness Non-200 status:
```
Health check returned status: 503
```
Solution: Backend is down or overloaded Slot lag:
```
Backend lagging: slot 123456 is 150 behind max 123606
```
Solution: Backend is syncing, increase max_slot_lag or wait for sync

Frequent Health State Flapping

Symptom:

Backend mainnet-primary marked as UNHEALTHY after 3 consecutive failures
Backend mainnet-primary marked as HEALTHY after 2 consecutive successes
Backend mainnet-primary marked as UNHEALTHY after 3 consecutive failures

Cause: Backend is intermittently failing health checks Solutions:

Increase thresholds to reduce flapping:

[health_check]
consecutive_failures_threshold = 5  # Increase from 3
consecutive_successes_threshold = 3  # Increase from 2

Increase health check interval:

[health_check]
interval_secs = 60  # Reduce check frequency

Investigate backend stability:
- Check backend logs
- Monitor backend resource usage
- Test network connectivity

WebSocket Issues

WebSocket Connection Failed

Log output:

WebSocket: Failed to connect to backend mainnet-primary (wss://api.mainnet-beta.solana.com): Connection refused

Solutions:

Verify backend WebSocket URL:

[[backends]]
label = "mainnet-primary"
url = "https://api.mainnet-beta.solana.com"
ws_url = "wss://api.mainnet-beta.solana.com"  # Check this

Test WebSocket endpoint directly:

wscat -c wss://api.mainnet-beta.solana.com

Check WebSocket metrics:

sum(rate(ws_connections_total{status="backend_connect_failed"}[5m]))

No WebSocket Backends Available

Symptom:

503 Service Unavailable: No healthy WebSocket backends available

Cause: No backends have ws_url configured or all WS backends are unhealthy Solutions:

Add WebSocket URLs to backends:

[[backends]]
label = "mainnet-primary"
url = "https://api.mainnet-beta.solana.com"
ws_url = "wss://api.mainnet-beta.solana.com"  # Add this

Check which backends support WebSocket:

grep -A 4 "\[\[backends\]\]" config.toml | grep ws_url

Reload configuration:
```
kill -SIGHUP $(pidof sol-rpc-router)
```

Configuration Reload Issues

Reload Not Taking Effect

Symptom: Sent SIGHUP but configuration unchanged Diagnostic steps:

Check logs for reload attempt:

journalctl -u sol-rpc-router | grep -i sighup

Expected:

Received SIGHUP, reloading configuration from config.toml
Configuration reloaded successfully
Router state atomically swapped

Verify signal reached process:

# Check if process is running
kill -0 $(pidof sol-rpc-router) && echo "Running" || echo "Not running"

Check for validation errors:

journalctl -u sol-rpc-router | grep -i "failed to reload"

Verify config file permissions:

ls -l config.toml
# Should be readable by router process user

Configuration Validation Failed on Reload

Log output:

Failed to reload configuration: Method route 'getSlot' references non-existent backend 'unknown'

Solution: Old configuration remains active. Fix validation error and retry:

Review error message
Fix configuration
Send SIGHUP again

Debugging Tools

Log Analysis

Enable Debug Logging

RUST_LOG=debug ./sol-rpc-router --config config.toml

Log levels:

error: Critical failures
warn: Non-critical issues (health check failures, rate limits)
info: Normal operations (requests, reloads)
debug: Detailed diagnostics

Filter Logs by Component

# Health check logs only
journalctl -u sol-rpc-router | grep "Health check"

# WebSocket logs only
journalctl -u sol-rpc-router | grep "WebSocket"

# Configuration reload logs
journalctl -u sol-rpc-router | grep -i "reload\|sighup"

# Request logs with RPC method
journalctl -u sol-rpc-router | grep "rpc_method="

Structured Log Parsing

Logs use structured format. Example log line:

POST / 192.168.1.100:54321 1.234s rpc_method=getSlot backend=mainnet-primary

Parse with awk:

# Extract requests by RPC method
journalctl -u sol-rpc-router | awk '/rpc_method=getSlot/'

# Extract requests by backend
journalctl -u sol-rpc-router | awk '/backend=mainnet-primary/'

# Extract slow requests (>1s)
journalctl -u sol-rpc-router | awk '$5 ~ /^[1-9]/ && $5 ~ /s$/'

Health Check Diagnostics

Manual Health Check Test

Simulate the router’s health check:

# Test with getSlot (default health check method)
curl -X POST https://api.mainnet-beta.solana.com \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"getSlot"}' \
  --max-time 5

# Test with getBlockHeight
curl -X POST https://api.mainnet-beta.solana.com \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"getBlockHeight"}' \
  --max-time 5

Expected response:

{"jsonrpc":"2.0","result":123456789,"id":1}

Check Backend Slot Lag

# Query all backends
for backend in "https://api.mainnet-beta.solana.com" "https://solana-api.com"; do
  slot=$(curl -s -X POST "$backend" \
    -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","id":1,"method":"getSlot"}' \
    | jq -r '.result')
  echo "$backend: slot $slot"
done

# Calculate lag manually

Metrics-Based Debugging

Check Current Backend Health

# Query Prometheus
curl -s 'http://localhost:9090/api/v1/query?query=rpc_backend_health' | jq

# Or query router metrics directly
curl -s http://localhost:28901/metrics | grep rpc_backend_health

Identify High-Error Backends

# Error rate by backend
sum(rate(rpc_requests_total{status!="200"}[5m])) by (backend)

# Backends with >5% error rate
sum(rate(rpc_requests_total{status!="200"}[5m])) by (backend)
/ sum(rate(rpc_requests_total[5m])) by (backend) > 0.05

Identify Slow Backends

# P99 latency by backend
histogram_quantile(0.99, 
  sum(rate(rpc_request_duration_seconds_bucket[5m])) by (le, backend)
)

# Backends with P99 > 5s
histogram_quantile(0.99,
  sum(rate(rpc_request_duration_seconds_bucket[5m])) by (le, backend)
) > 5

Network Diagnostics

Test Backend Connectivity

# DNS resolution
dig api.mainnet-beta.solana.com

# TCP connectivity
telnet api.mainnet-beta.solana.com 443

# TLS handshake
openssl s_client -connect api.mainnet-beta.solana.com:443 -servername api.mainnet-beta.solana.com

# Full request test
curl -v https://api.mainnet-beta.solana.com

Check Network Latency

# Ping test
ping -c 5 api.mainnet-beta.solana.com

# Traceroute
traceroute api.mainnet-beta.solana.com

# MTR (combined ping + traceroute)
mtr --report --report-cycles 10 api.mainnet-beta.solana.com

Performance Issues

High Memory Usage

Diagnostic:

ps aux | grep sol-rpc-router
top -p $(pidof sol-rpc-router)

Potential causes:

Too many active WebSocket connections
Large request/response bodies
Memory leak (report as bug)

Solutions:

Check active WebSocket connections:
```
sum(ws_active_connections)
```

Monitor connection duration:

histogram_quantile(0.99, sum(rate(ws_connection_duration_seconds_bucket[5m])) by (le))

Restart if memory leak suspected:
```
systemctl restart sol-rpc-router
```

High CPU Usage

Diagnostic:

top -p $(pidof sol-rpc-router)

Potential causes:

High request volume
Expensive regex in logs
Too frequent health checks

Solutions:

Check request rate:
```
sum(rate(rpc_requests_total[1m]))
```

Reduce health check frequency:

[health_check]
interval_secs = 60  # Increase from 30

Consider horizontal scaling

Getting Help

If you’ve tried the above steps and still have issues:

Collect diagnostic information:

# Router version
./sol-rpc-router --version

# Configuration (redact sensitive URLs)
cat config.toml

# Recent logs
journalctl -u sol-rpc-router --since "1 hour ago" > router-logs.txt

# Health status
curl http://localhost:28899/health > health-status.json

# Metrics snapshot
curl http://localhost:28901/metrics > metrics.txt

Check existing issues: Search the GitHub issue tracker
Report the bug: Include diagnostic info and steps to reproduce

Get Started

Configuration

Features

Operations

​Common Issues

​Startup Failures

​Redis Connection Failed

​Port Already in Use

​Configuration Validation Failed

​Request Failures

​Unauthorized (401)

​Rate Limited (429)

​No Healthy Backends (503)

​Gateway Timeout (504)

​Backend Health Issues

​Backend Marked Unhealthy

​Frequent Health State Flapping

​WebSocket Issues

​WebSocket Connection Failed

​No WebSocket Backends Available

​Configuration Reload Issues

​Reload Not Taking Effect

​Configuration Validation Failed on Reload

​Debugging Tools

​Log Analysis

​Enable Debug Logging

​Filter Logs by Component

​Structured Log Parsing

​Health Check Diagnostics

​Manual Health Check Test

​Check Backend Slot Lag

​Metrics-Based Debugging

​Check Current Backend Health

​Identify High-Error Backends

​Identify Slow Backends

​Network Diagnostics

​Test Backend Connectivity

​Check Network Latency

​Performance Issues

​High Memory Usage

​High CPU Usage

​Getting Help

​Next Steps

Build docs developers (and LLMs) love

Common Issues

Startup Failures

Redis Connection Failed

Port Already in Use

Configuration Validation Failed

Request Failures

Unauthorized (401)

Rate Limited (429)

No Healthy Backends (503)

Gateway Timeout (504)

Backend Health Issues

Backend Marked Unhealthy

Frequent Health State Flapping

WebSocket Issues

WebSocket Connection Failed

No WebSocket Backends Available

Configuration Reload Issues

Reload Not Taking Effect

Configuration Validation Failed on Reload

Debugging Tools

Log Analysis

Enable Debug Logging

Filter Logs by Component

Structured Log Parsing

Health Check Diagnostics

Manual Health Check Test

Check Backend Slot Lag

Metrics-Based Debugging

Check Current Backend Health

Identify High-Error Backends

Identify Slow Backends

Network Diagnostics

Test Backend Connectivity

Check Network Latency

Performance Issues

High Memory Usage

High CPU Usage

Getting Help

Next Steps