Production Troubleshooting
This guide provides systematic troubleshooting procedures for common validator issues encountered in production.Common Issues
Validator Delinquent
Symptom: Validator marked as delinquent, not voting on blocks. Diagnosis:-
Validator Behind Cluster:
- Symptom: Large slot distance from cluster
- Solution: Wait for catchup, check network/CPU performance
-
Insufficient Compute Resources:
- Symptom: High CPU usage, slow PoH tick rate
- Solution: Check CPU governor, verify release build
-
Network Connectivity Issues:
- Symptom: Not in gossip, no network activity
- Solution: Check firewall, network connectivity
-
Disk I/O Bottleneck:
- Symptom: High disk wait time, slow replay
- Solution: Monitor disk I/O, consider faster drives
Validator Won’t Start
Symptom: Validator process exits immediately after starting. Diagnosis:-
Port Already in Use:
- Error: “Address already in use”
- Solution: Check for conflicting processes, adjust ports
-
Insufficient File Descriptors:
- Error: “Too many open files”
- Solution: Verify systemd limits
-
Insufficient Memory Lock:
- Error: “Cannot allocate memory” or “mlock failed”
- Solution: Verify MEMLOCK limit
-
Corrupted Ledger:
- Error: “Failed to load ledger” or “RocksDB error”
- Solution: Restore from snapshot or resync
-
Invalid Genesis Hash:
- Error: “Genesis hash mismatch”
- Solution: Verify correct genesis hash for cluster
Out of Disk Space
Symptom: Validator stops, “No space left on device” errors. Diagnosis:-
Enable Ledger Limiting:
-
Clean Up Old Snapshots:
-
Add Storage:
- Provision larger drives
- Move snapshots to separate storage
-
Disable Transaction History:
High Skip Rate
Symptom: Validator skipping >5% of assigned slots. Diagnosis:-
CPU Performance:
- Slow PoH tick rate
- Solution: Verify CPU governor, clock speed
-
Disk I/O Latency:
- High disk wait times
- Solution: Upgrade to faster NVMe drives
-
Network Latency:
- Poor connectivity to cluster
- Solution: Improve network connection, reduce latency
-
Insufficient Memory:
- Swapping to disk
- Solution: Add more RAM or reduce memory usage
Memory Issues
Symptom: High memory usage, OOM kills, swapping. Diagnosis:-
Add More RAM:
- Mainnet validators typically need 256-512GB
-
Tune Accounts Cache:
-
Disable Unnecessary Features:
Network Connectivity Issues
Symptom: Validator not appearing in gossip, network errors in logs. Diagnosis:-
Firewall Configuration:
-
Port Forwarding:
- Configure router/firewall for port forwarding
- Verify public IP address is correct
-
Dynamic Port Range:
-
DNS Issues:
Diagnostic Tools
Solana CLI Tools
solana gossip: List all validators in gossip network.Validator Admin RPC
Enable Admin RPC:Blockstore Inspection
Using ldb (RocksDB tool):System Diagnostic Commands
Process Information:Log Analysis
Critical Log Patterns
Startup Validation:Log Analysis Script
Performance Debugging
CPU Performance
Check CPU Speed:Memory Profiling
Heap Profiling:Network Performance
Bandwidth Testing:Disk Performance
Benchmark Disk I/O:Recovery Procedures
Ledger Recovery
Recover from Snapshot:Accounts Database Recovery
Rebuild from Snapshot:Vote Account Issues
Insufficient Balance:- Typically resolves when validator catches up
- Ensure validator is running and healthy
- Monitor vote credits increasing
Getting Help
Community Resources
Discord Channels:- #validator-support - General validator help
- #testnet-announcements - Important cluster updates
- Join: https://solana.com/discord
- Network Explorer: http://explorer.solana.com/
- Metrics Dashboard: https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry
- GitHub Issues: https://github.com/anza-xyz/agave/issues
Reporting Issues
Information to Include:- Validator pubkey
- Cluster (mainnet/testnet/devnet)
- Agave version (
agave-validator --version) - System specifications (CPU, RAM, storage)
- Relevant log excerpts
- Steps to reproduce
- Recent changes to configuration
Before Seeking Help
Checklist:- Checked recent logs for errors
- Verified system resources (CPU, RAM, disk)
- Confirmed network connectivity
- Reviewed recent configuration changes
- Searched existing issues/discussions
- Tried basic troubleshooting steps
- Collected relevant diagnostic information
Emergency Procedures
Validator Crash
-
Check if process is running:
-
Review crash logs:
-
Check for OOM kill:
-
Restart if safe:
-
Monitor recovery:
Data Corruption
If ledger corruption is suspected:- Stop validator immediately
- Backup corrupted data for analysis
- Restore from clean snapshot
- Report issue with details
Security Incident
If compromise is suspected:- Isolate validator from network
- Assess compromise scope
- Follow security incident procedures (see security guide)
- Report to appropriate channels