Production Monitoring
Comprehensive monitoring is essential for maintaining validator health, diagnosing issues quickly, and ensuring optimal performance.Metrics Collection
InfluxDB Integration
Configuration: Set theSOLANA_METRICS_CONFIG environment variable to report metrics to InfluxDB.
- Production: https://metrics.solana.com:8888/
- Testing: https://metrics.solana.com:8889/
- Query historical performance data
- Create custom dashboards
- Set up alerts based on metrics
- Analyze trends over time
Key Metrics to Monitor
Validator Health:validator_status: Online/offline statusvalidator_delinquent: Delinquency flagvalidator_skip_rate: Percentage of skipped slotsvalidator_root_slot: Latest root slotvalidator_vote_distance: Distance from cluster vote
banking_stage_transactions_processed: TPS throughputbanking_stage_slot_boundary_count: Slot completion ratepoh_tick_rate: PoH tick generation ratereplay_stage_replay_transactions: Transaction replay rate
process_cpu_usage: CPU utilization percentageprocess_memory_usage: Memory usage in bytesprocess_disk_usage: Disk space usagenetwork_rx_bytes: Network receive ratenetwork_tx_bytes: Network transmit rate
gossip_num_nodes: Number of known nodesgossip_crds_table_size: CRDS table sizegossip_push_msg_count: Gossip push message count
rpc_request_count: Total RPC requestsrpc_request_duration: Request latencyrpc_error_count: RPC error rate
Health Monitoring
agave-watchtower
Overview:agave-watchtower is a monitoring tool that tracks validator health and sends notifications when issues are detected.
Basic Usage:
- Run watchtower on a separate server (not on validator)
- Configure multiple notification channels
- Monitor both validator and cluster health
Notification Channels
Telegram Setup
Step 1: Create Bot- Message
@BotFatheron Telegram - Send command:
/newbot - Name your bot (must end with “bot”)
- Save the HTTP API token
- Find your bot in Telegram
- Send:
/start
- Create new Telegram group
- Add your bot to the group
- Send message to bot:
@yourbot hello
Discord Setup
Create Webhook:- Open Discord server settings
- Navigate to Integrations → Webhooks
- Create webhook and copy URL
Slack Setup
Create Webhook:- Go to Slack App Directory
- Search for “Incoming Webhooks”
- Add to workspace and select channel
- Copy webhook URL
Twilio SMS Setup
Get Credentials:- Create Twilio account
- Get Account SID and Auth Token
- Get Twilio phone number
Running Watchtower as Service
Create Systemd Service: Create/etc/systemd/system/watchtower.service:
Manual Health Checks
Check Validator Status:Performance Dashboards
Public Grafana Dashboards
Cluster Telemetry: https://metrics.solana.com/d/monitor-edge/cluster-telemetry Shows:- Cluster stability
- Validator streamer stats
- Tower consensus metrics
- IP network statistics
- Snapshot status
- RPC service health
- Total prioritization fees
- Block minimum fees
- Cost tracker statistics
- Ping API metrics
- Validator responsiveness
Custom Dashboards
Grafana Setup: Deploy your own Grafana instance for custom dashboards:- Add InfluxDB as data source
- Configure connection to metrics database
- Test connection
- Current slot vs cluster slot
- Vote credits per epoch
- Skip rate over time
- CPU and memory usage
- Network bandwidth
- Disk I/O and space
- Transaction processing rate
Performance Baselines
Expected Performance Metrics: PoH Tick Rate:- Should match cluster target (typically ~6,250 ticks/second)
- Slower rate indicates CPU performance issues
- Check:
lscpufor clock speed
- Target: Less than 5% for healthy validator
- 5-10%: Investigate performance
- Greater than 10%: Critical, likely delinquent
- Target: Less than 128 slots from cluster
- Higher values indicate catchup issues
- Typical: 128-256GB utilized
- Monitor for memory leaks (constantly increasing)
- Ledger: High read/write during catchup
- Accounts: Steady I/O during normal operation
- High I/O can indicate disk bottleneck
Alerting Setup
Critical Alerts
Immediate Action Required:- Validator delinquent
- Vote account credits stopped increasing
- Validator not in gossip
- Process crashed/stopped
- Disk space less than 10% free
- Memory usage greater than 95%
- Skip rate greater than 5%
- Vote distance greater than 100 slots
- Disk space less than 20% free
- Memory usage greater than 80%
- High CPU usage (greater than 90%) sustained
- Network errors increasing
Alert Configuration Examples
Watchtower Alerts: Automatic notifications when:- Validator becomes delinquent
- Validator returns to active status
- Identity balance low
- Vote account balance low
Log Analysis
Important Log Patterns
Healthy Operation:Log Monitoring Commands
Real-time Monitoring:Resource Monitoring
System Resource Monitoring
CPU Monitoring:Automated Resource Monitoring
Using collectd:Monitoring Best Practices
Essential Monitoring Checklist
Setup:- agave-watchtower running on separate server
- Multiple notification channels configured
- Grafana dashboards configured
- Alert thresholds set appropriately
- Log rotation configured
- Resource monitoring enabled
- Check validator status in dashboard
- Review alert notifications
- Verify vote credits increasing
- Check skip rate
- Review performance trends
- Analyze error logs
- Check disk space growth rate
- Verify backup procedures
- Review and tune alert thresholds
- Update monitoring dashboards
- Test alert notifications
- Document any incidents
Integration with Existing Tools
APM Solutions:- Datadog
- New Relic
- AppDynamics
- ELK Stack (Elasticsearch, Logstash, Kibana)
- Splunk
- Graylog
- PagerDuty
- Opsgenie
- VictorOps
Troubleshooting with Monitoring Data
When issues occur, use monitoring data to:- Establish Timeline: When did the issue start?
- Identify Symptoms: What metrics are abnormal?
- Correlate Events: What changed before the issue?
- Resource Analysis: Are resources exhausted?
- Compare Baselines: How does current state differ from normal?