Skip to main content

Overview

Sol RPC Router exposes Prometheus-compatible metrics at /metrics for comprehensive monitoring of request performance, backend health, WebSocket connections, and API key usage.

Metrics Endpoint

Metrics are served on a dedicated port (default 9091):
curl http://localhost:9091/metrics
Configuration in config.toml:
config.toml
metrics_port = 9091

Prometheus Configuration

The router initializes Prometheus with histogram buckets optimized for RPC latencies (src/main.rs:38-45):
src/main.rs
let builder = PrometheusBuilder::new()
    .set_buckets(&[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0])
    .expect("failed to set histogram buckets");
let handle = builder
    .install_recorder()
    .expect("failed to install Prometheus recorder");
Histogram Buckets (in seconds):
  • 0.001 (1ms) to 10.0 (10s)
  • Enables accurate percentile calculations with histogram_quantile()
  • Covers typical RPC latencies from fast reads to slow historical queries

HTTP Request Metrics

rpc_request_duration_seconds

Type: Histogram
Labels: rpc_method, backend, owner
Tracks request duration from receipt to response. Implementation in src/handlers.rs:134:
src/handlers.rs
histogram!("rpc_request_duration_seconds", 
    "rpc_method" => rpc_method.clone(), 
    "backend" => backend.clone(), 
    "owner" => owner.clone())
    .record(duration);
Example Queries:
# Average latency per backend
rate(rpc_request_duration_seconds_sum[5m])
  / rate(rpc_request_duration_seconds_count[5m])

# Average latency per RPC method
rate(rpc_request_duration_seconds_sum{rpc_method="getSlot"}[5m])
  / rate(rpc_request_duration_seconds_count{rpc_method="getSlot"}[5m])

rpc_requests_total

Type: Counter
Labels: method, status, rpc_method, backend, owner
Counts all requests with HTTP method, status code, RPC method, backend, and owner. Implementation in src/handlers.rs:135:
src/handlers.rs
counter!("rpc_requests_total", 
    "method" => method, 
    "status" => status, 
    "rpc_method" => rpc_method, 
    "backend" => backend, 
    "owner" => owner)
    .increment(1);
Example Queries:
# Total requests per second
sum(rate(rpc_requests_total[5m]))

# Requests per backend
sum by (backend) (rate(rpc_requests_total[5m]))

# Requests per RPC method
sum by (rpc_method) (rate(rpc_requests_total[5m]))

WebSocket Metrics

ws_connections_total

Type: Counter
Labels: backend, owner, status
Tracks WebSocket connection attempts with status outcomes. Status Values:
  • connected - Successful upgrade and backend connection
  • auth_failed - Invalid API key or missing authentication
  • rate_limited - Rate limit exceeded
  • no_backend - No healthy backends with ws_url
  • backend_connect_failed - Failed to connect to backend WebSocket
  • error - Internal error (Redis failure, etc.)
Implementation in src/handlers.rs:359,371,380,394,432:
src/handlers.rs
// Successful connection
counter!("ws_connections_total", 
    "backend" => backend_label.clone(), 
    "owner" => owner.clone(), 
    "status" => "connected").increment(1);

// Auth failure
counter!("ws_connections_total", 
    "backend" => "none", 
    "owner" => "none", 
    "status" => "auth_failed").increment(1);

// Rate limited
counter!("ws_connections_total", 
    "backend" => "none", 
    "owner" => "none", 
    "status" => "rate_limited").increment(1);
Example Queries:
# Successful connections per second
rate(ws_connections_total{status="connected"}[5m])

# Failed connections per second
rate(ws_connections_total{status!="connected"}[5m])

ws_active_connections

Type: Gauge
Labels: backend, owner
Tracks currently open WebSocket sessions. Implementation in src/handlers.rs:438,549:
src/handlers.rs
// Increment on connect
gauge!("ws_active_connections", 
    "backend" => backend_label.clone(), 
    "owner" => owner.clone())
    .increment(1.0);

// Decrement on disconnect
gauge!("ws_active_connections", 
    "backend" => backend_label.clone(), 
    "owner" => owner.clone())
    .decrement(1.0);
Example Queries:
# Total active connections
sum(ws_active_connections)

# Active connections per backend
sum by (backend) (ws_active_connections)

# Active connections per owner
sum by (owner) (ws_active_connections)

ws_messages_total

Type: Counter
Labels: backend, owner, direction
Counts WebSocket frames relayed in each direction. Direction Values:
  • client_to_backend - Messages from client to backend
  • backend_to_client - Messages from backend to client
Implementation in src/handlers.rs:461,471,508,514:
src/handlers.rs
// Client to backend
counter!("ws_messages_total", 
    "backend" => bl1.clone(), 
    "owner" => ow1.clone(), 
    "direction" => "client_to_backend")
    .increment(1);

// Backend to client
counter!("ws_messages_total", 
    "backend" => bl2.clone(), 
    "owner" => ow2.clone(), 
    "direction" => "backend_to_client")
    .increment(1);
Only Text and Binary frames are counted. Ping/Pong frames are forwarded transparently without incrementing metrics.
Example Queries:
# Total messages per second
sum(rate(ws_messages_total[5m]))

# Messages per direction
sum by (direction) (rate(ws_messages_total[5m]))

# Inbound vs outbound ratio
rate(ws_messages_total{direction="client_to_backend"}[5m])
  / rate(ws_messages_total{direction="backend_to_client"}[5m])

ws_connection_duration_seconds

Type: Histogram
Labels: backend, owner
Tracks session duration from upgrade to disconnect. Implementation in src/handlers.rs:550:
src/handlers.rs
let duration = connect_time.elapsed().as_secs_f64();
histogram!("ws_connection_duration_seconds", 
    "backend" => backend_label.clone(), 
    "owner" => owner.clone())
    .record(duration);
Example Queries:
# Average connection duration
rate(ws_connection_duration_seconds_sum[5m])
  / rate(ws_connection_duration_seconds_count[5m])

# Average per backend
rate(ws_connection_duration_seconds_sum[5m])
  / rate(ws_connection_duration_seconds_count[5m])
  by (backend)

Grafana Dashboard Panels

Request Overview

Visualization: Graph (Time Series)
sum(rate(rpc_requests_total[5m]))
Legend: Total Requests/sec

Backend Health

Visualization: Stat PanelQuery /health endpoint and display backend status.Use JSON API datasource or Infinity plugin.

WebSocket Monitoring

Visualization: Stat Panel + Graph
sum(ws_active_connections)
Current count with time series.

API Key Usage

Visualization: Table
sum by (owner) (rate(rpc_requests_total[5m]))
Columns: Owner, Requests/sec

RPC Method Analysis

Visualization: Table
topk(20, sum by (rpc_method) (
  rate(rpc_requests_total[5m])
))
Top 20 methods by request rate.

Alerting Rules

High Error Rate

groups:
  - name: rpc_router_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(rpc_requests_total{status=~"5.."}[5m]))
            / sum(rate(rpc_requests_total[5m]))
            > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High 5xx error rate ({{ $value | humanizePercentage }})"
          description: "More than 5% of requests are returning 5xx errors"

Backend Down

- alert: AllBackendsUnhealthy
  expr: |
    absent(backend_healthy{healthy="true"})
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "All backends are unhealthy"
    description: "No healthy backends available for routing"

High Latency

- alert: HighLatency
  expr: |
    histogram_quantile(0.95,
      rate(rpc_request_duration_seconds_bucket[5m])
    ) > 2.0
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "High p95 latency ({{ $value }}s)"
    description: "95th percentile latency is above 2 seconds"

WebSocket Failures

- alert: WebSocketConnectionFailures
  expr: |
    sum(rate(ws_connections_total{status!="connected"}[5m]))
      / sum(rate(ws_connections_total[5m]))
      > 0.10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High WebSocket failure rate ({{ $value | humanizePercentage }})"
    description: "More than 10% of WebSocket connections are failing"

Rate Limit Threshold

- alert: HighRateLimitRejections
  expr: |
    sum(rate(rpc_requests_total{status="429"}[5m])) > 10
  for: 5m
  labels:
    severity: info
  annotations:
    summary: "High rate limit rejections"
    description: "More than 10 requests/sec are being rate limited"

Health Endpoint

In addition to Prometheus metrics, the router provides a health check endpoint:
curl http://localhost:28899/health
Response (src/handlers.rs:296-310):
{
  "overall_status": "healthy",
  "backends": [
    {
      "label": "mainnet-primary",
      "healthy": true,
      "last_check": "Instant { ... }",
      "consecutive_failures": 0,
      "consecutive_successes": 15,
      "last_error": null
    },
    {
      "label": "backup-rpc",
      "healthy": false,
      "last_check": "Instant { ... }",
      "consecutive_failures": 5,
      "consecutive_successes": 0,
      "last_error": "Connection timeout"
    }
  ]
}
Fields:
  • overall_status: "healthy" if any backend is healthy, else "unhealthy"
  • backends: Array of backend health details
    • label: Backend identifier
    • healthy: Current health status
    • last_check: Timestamp of last health check
    • consecutive_failures: Failure streak count
    • consecutive_successes: Success streak count
    • last_error: Most recent error message (if any)
Use the /health endpoint for external load balancer health checks or monitoring systems that don’t support Prometheus.

Metric Label Cardinality

Label Values

All metrics use labels with controlled cardinality:
LabelCardinalityExample Values
backendLow (2-10)mainnet-primary, backup-rpc
ownerMedium (10-1000)API key owner identifiers
rpc_methodLow (~50)getSlot, getTransaction, etc.
methodVery Low (2-3)POST, GET
statusLow (~10)200, 401, 429, 500, etc.
directionVery Low (2)client_to_backend, backend_to_client
High cardinality in the owner label can impact Prometheus performance with thousands of API keys. Consider aggregating by owner selectively or using recording rules.

Recording Rules

Pre-aggregate common queries to reduce query time:
groups:
  - name: rpc_router_recordings
    interval: 30s
    rules:
      # Total request rate
      - record: rpc:requests:rate5m
        expr: sum(rate(rpc_requests_total[5m]))
      
      # Request rate per backend
      - record: rpc:requests:rate5m:by_backend
        expr: sum by (backend) (rate(rpc_requests_total[5m]))
      
      # Average latency per backend
      - record: rpc:latency:avg5m:by_backend
        expr: |
          rate(rpc_request_duration_seconds_sum[5m])
            / rate(rpc_request_duration_seconds_count[5m])
            by (backend)
      
      # P95 latency
      - record: rpc:latency:p95:5m
        expr: |
          histogram_quantile(0.95,
            sum by (le) (rate(rpc_request_duration_seconds_bucket[5m]))
          )

Log Integration

Requests are logged with structured fields for correlation with metrics (src/handlers.rs:72-103):
src/handlers.rs
match (rpc_method, backend) {
    (Some(RpcMethod(m)), Some(SelectedBackend(b))) => info!(
        "{} {} {} {:?} rpc_method={} backend={}",
        method, path, addr, duration, m, b
    ),
    // ... other cases
}
Example Log Output:
INFO POST / 127.0.0.1:54321 42ms rpc_method=getSlot backend=mainnet-primary
INFO POST / 127.0.0.1:54322 156ms rpc_method=getTransaction backend=archive-node
WARN API key rate limited (prefix=abc123...)
Use log aggregation (Loki, Elasticsearch) to correlate metrics with detailed request logs. Filter by rpc_method or backend to investigate latency spikes.

Monitoring Best Practices

Organize Grafana dashboards by user role:
  • Operations: Request rate, error rate, latency, backend health
  • Performance: Latency percentiles, slowest methods, backend comparison
  • Business: Requests per owner, top methods, usage trends
  • WebSocket: Active connections, message throughput, connection duration
Set alert thresholds based on baseline behavior:
  1. Collect 1-2 weeks of metrics
  2. Calculate p95/p99 for latency, error rate, etc.
  3. Set thresholds 20-30% above normal values
  4. Adjust based on incident frequency
Configure Prometheus retention based on query patterns:
  • Raw metrics: 15-30 days (for incident investigation)
  • Recording rules: 90+ days (for trend analysis)
  • Long-term storage: Use Thanos or Cortex for historical data
Optimize dashboard queries:
  • Use recording rules for frequently-queried aggregations
  • Limit time ranges to necessary windows (5m, 1h, 24h)
  • Avoid high-cardinality group-by operations on owner label
  • Use instant queries for current-state panels (gauges, stats)

Sample Prometheus Config

prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'sol-rpc-router'
    static_configs:
      - targets: ['localhost:9091']
        labels:
          environment: 'production'
          cluster: 'us-west-2'

rule_files:
  - 'rpc_router_alerts.yml'
  - 'rpc_router_recordings.yml'

Exporting Metrics

The metrics endpoint returns standard Prometheus exposition format:
curl -s http://localhost:9091/metrics | grep rpc_requests_total | head -5
Output:
rpc_requests_total{backend="mainnet-primary",method="POST",owner="client-a",rpc_method="getSlot",status="200"} 1523
rpc_requests_total{backend="mainnet-primary",method="POST",owner="client-a",rpc_method="getTransaction",status="200"} 342
rpc_requests_total{backend="backup-rpc",method="POST",owner="client-b",rpc_method="getSlot",status="200"} 891
rpc_requests_total{backend="mainnet-primary",method="POST",owner="client-a",rpc_method="sendTransaction",status="429"} 15
All metrics follow Prometheus naming conventions with _total, _seconds, _bucket suffixes for counters, histograms, and histogram buckets.

Build docs developers (and LLMs) love