Validator Monitoring
Proper monitoring is essential for maintaining a healthy validator. This guide covers various monitoring approaches and tools.Key Metrics to Monitor
Validator Health
- Catchup Status: Is the validator caught up with the network?
- Vote Credits: Are votes being submitted and credited?
- Block Production: Is the validator producing blocks during leader slots?
- Delinquency: Is the validator marked as delinquent?
- Skip Rate: What percentage of leader slots are being skipped?
System Resources
- CPU Usage: Should stay below 80-90%
- Memory Usage: Monitor for leaks or excessive consumption
- Disk I/O: Watch for bottlenecks
- Network Bandwidth: Track inbound/outbound traffic
- Disk Space: Ensure adequate free space
Network Metrics
- Gossip Connectivity: Number of peers
- RPC Requests: Request rate and latency (if running RPC)
- Transaction Processing: TPS and queue depth
Command-Line Monitoring
Catchup Status
Monitor if your validator is caught up:Vote Account Monitoring
Check vote credits and status:Credits: Should increase every epochCommission: Your commission rateLast Vote: Should be recentRoot Slot: Should advance regularly
Block Production
View block production statistics:- Leader Slots: Number of slots you’re scheduled to produce
- Blocks Produced: Actual blocks produced
- Skip Rate: Should be below 5%
Gossip Network
View your validator in gossip:Validator Monitor Command
Continuous monitoring:- Slot and epoch
- Transaction count
- Shred insert rate
- Vote status
Health Check Endpoints
RPC Health
The validator exposes a health endpoint when RPC is enabled:"ok": Node is healthy{"error": ...}: Node is unhealthy (with details)
Slot Monitoring
Get current slot:Version Check
Check validator version:Metrics Exporters
The validator exposes metrics in various formats.InfluxDB Metrics
The validator can send metrics to InfluxDB. Configure using environment variables:- Banking stage statistics
- Vote submission metrics
- Shred processing rates
- Replay stage performance
- Account database statistics
Prometheus Metrics
While not built-in, you can parse log output or use custom exporters to expose Prometheus metrics.Automated Monitoring Scripts
Basic Health Check Script
Createvalidator-health-check.sh:
Vote Credits Monitor
Createvote-credits-monitor.sh:
System Resource Monitoring
CPU Monitoring
Monitor validator CPU usage:Memory Monitoring
Check memory usage:Disk Space Monitoring
Monitor ledger directory:Disk I/O Monitoring
Installiotop:
iostat:
Network Monitoring
Monitor network connections:nload:
Log Analysis
Important Log Patterns
Monitor for errors:Performance Metrics in Logs
Search for specific metrics:Alert Configuration
Email Alerts
Installmailutils:
Slack/Discord Webhooks
Send alerts to Slack:Third-Party Monitoring Services
Solana Beach
Monitor your validator on Solana Beach:- Search for your validator identity
- Add to favorites
- Enable notifications
Stakewiz
Track performance on Stakewiz:- View historical performance
- Compare with other validators
- Monitor skip rate trends
Validators.app
Monitor on Validators.app:- Real-time performance metrics
- Cluster-wide statistics
- Alert configurations
Monitoring Dashboard
Grafana + Prometheus Setup
For comprehensive monitoring, set up Grafana with Prometheus:-
Install Prometheus:
-
Install Grafana:
-
Configure node_exporter for system metrics:
- Create custom exporter for validator metrics
- Import Solana Grafana dashboards from the community
Key Dashboard Panels
- Validator catchup status
- Vote credits over time
- Block production rate
- Skip rate
- System resource utilization
- Network connections
- Ledger size growth
- Transaction processing rate
Performance Benchmarking
Baseline Metrics
Establish baseline performance:Best Practices
Monitor Continuously
Set up automated monitoring that runs 24/7 and alerts on issues.
Track Trends
Monitor trends over time, not just current values.
Set Thresholds
Configure alerts with appropriate thresholds to avoid alert fatigue.
Document Baselines
Record normal operating metrics for comparison.
Next Steps
Operations
Learn about operational procedures
Troubleshooting
Troubleshoot issues detected by monitoring
Configuration
Optimize configuration based on metrics