Skip to main content

Troubleshooting

This guide covers common issues you may encounter when running a Harmonic Salsa validator and their solutions.

Connection Problems

Validator Not Connecting to Network

Symptoms:
  • Validator shows 0 peers in gossip
  • Cannot find entrypoints
  • Network timeout errors
Solutions:
  1. Check network connectivity:
    ping entrypoint.mainnet-beta.solana.com
    
  2. Verify entrypoints are correct:
    # For mainnet
    --entrypoint entrypoint.mainnet-beta.solana.com:8001
    
    # For testnet
    --entrypoint entrypoint.testnet.solana.com:8001
    
  3. Check firewall settings:
    # Allow required ports
    sudo ufw allow 8000:8020/tcp
    sudo ufw allow 8000:8020/udp
    
  4. Verify ports are open:
    # Test if gossip port is accessible
    nc -zv entrypoint.mainnet-beta.solana.com 8001
    
  5. Check if validator is binding to correct interface:
    # Verify bind address
    --bind-address 0.0.0.0
    

Low Peer Count

Symptoms:
  • Fewer than 50 peers
  • Intermittent connectivity
Solutions:
  1. Add more entrypoints:
    --entrypoint entrypoint.mainnet-beta.solana.com:8001 \
    --entrypoint entrypoint2.mainnet-beta.solana.com:8001 \
    --entrypoint entrypoint3.mainnet-beta.solana.com:8001
    
  2. Check for rate limiting:
    • Verify your IP is not being rate-limited
    • Check with your hosting provider
  3. Verify system resources:
    # Check open file limit
    ulimit -n
    # Should be at least 500000
    
  4. Increase connection limits if needed:
    # In /etc/security/limits.conf
    * soft nofile 1000000
    * hard nofile 1000000
    

RPC Connection Refused

Symptoms:
  • Cannot connect to local RPC
  • “Connection refused” errors
Solutions:
  1. Verify RPC is enabled:
    --rpc-port 8899
    
  2. Check RPC bind address:
    # For local access only
    --rpc-bind-address 127.0.0.1
    
    # For external access
    --rpc-bind-address 0.0.0.0
    
  3. Test RPC locally:
    curl -X POST -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","id":1,"method":"getHealth"}' http://localhost:8899
    
  4. Check if port is in use:
    sudo lsof -i :8899
    

Sync Issues

Validator Not Catching Up

Symptoms:
  • Catchup percentage stuck below 100%
  • Slot number far behind network
  • High “distance from network” value
Solutions:
  1. Check system resources:
    # CPU usage should be below 90%
    top -p $(pgrep agave-validator)
    
    # Check disk I/O wait
    iostat -x 5
    
  2. Verify snapshot download:
    # Check logs for snapshot download
    sudo journalctl -u agave-validator | grep snapshot
    
  3. Try downloading from different validator:
    # Add more known validators
    --known-validator <PUBKEY>
    
  4. Clear ledger and resync:
    sudo systemctl stop agave-validator
    rm -rf ~/validator-ledger
    sudo systemctl start agave-validator
    
  5. Check bandwidth:
    # Monitor network usage
    iftop
    

Slow Snapshot Download

Symptoms:
  • Snapshot download taking hours
  • Slow download speed in logs
Solutions:
  1. Increase minimum download speed:
    --minimal-snapshot-download-speed 20971520  # 20 MB/s
    
  2. Add more known validators:
    --known-validator 7Np41oeYqPefeNQEHSv1UDhYrehxin3NStELsSKCT4K2 \
    --known-validator GdnSyH3YtwcxFvQrVVJMm1JhTS4QVX7MFsX56uJLUfiZ
    
  3. Check network congestion:
    # Test bandwidth
    speedtest-cli
    
  4. Consider pre-downloading snapshot:
    • Download snapshot archive from a trusted source
    • Extract to ledger directory

Validator Keeps Restarting During Catchup

Symptoms:
  • Validator crashes during replay
  • Out of memory errors
  • Database corruption errors
Solutions:
  1. Check memory usage:
    free -h
    # Ensure adequate free memory
    
  2. Verify disk space:
    df -h
    # Ensure at least 20% free space
    
  3. Check for disk errors:
    sudo dmesg | grep -i error
    
  4. Increase swap if needed:
    sudo fallocate -l 32G /swapfile
    sudo chmod 600 /swapfile
    sudo mkswap /swapfile
    sudo swapon /swapfile
    
  5. Verify RocksDB settings:
    # Use appropriate compression
    --rocksdb-ledger-compression lz4
    

Performance Problems

High CPU Usage

Symptoms:
  • CPU consistently above 90%
  • Validator lagging behind network
  • High system load
Solutions:
  1. Check CPU allocation:
    lscpu
    # Verify core count and speed
    
  2. Optimize thread settings:
    --unified-scheduler-handler-threads 4
    
  3. Reduce QUIC endpoints if needed:
    --num-quic-endpoints 4
    
  4. Check for competing processes:
    top
    # Look for other CPU-intensive processes
    
  5. Verify CPU governor:
    cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    # Should be "performance"
    
    # Set to performance mode
    echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    

High Memory Usage

Symptoms:
  • Memory usage growing over time
  • System using swap heavily
  • OOM killer terminating processes
Solutions:
  1. Enable disk-based accounts index:
    --enable-accounts-disk-index \
    --accounts-index-path /mnt/nvme3/accounts-index
    
  2. Reduce accounts cache:
    --accounts-db-cache-limit-mb 8000
    
  3. Limit snapshot retention:
    --maximum-full-snapshots-to-retain 2 \
    --maximum-incremental-snapshots-to-retain 4
    
  4. Monitor for memory leaks:
    # Watch memory over time
    watch -n 60 'ps aux | grep agave-validator | grep -v grep'
    

Slow Disk I/O

Symptoms:
  • High I/O wait time
  • Slow ledger replay
  • Banking stage lag
Solutions:
  1. Verify disk performance:
    sudo fio --name=random-read --ioengine=libaio --iodepth=32 --rw=randread --bs=4k --direct=1 --size=1G --runtime=60
    
  2. Use separate drives:
    --ledger /mnt/nvme0/ledger \
    --accounts /mnt/nvme1/accounts \
    --snapshots /mnt/nvme2/snapshots
    
  3. Check disk health:
    sudo smartctl -a /dev/nvme0n1
    
  4. Optimize RocksDB:
    --rocksdb-shred-compaction level
    
  5. Monitor disk usage:
    sudo iotop -o
    

Network Bandwidth Exhausted

Symptoms:
  • High packet loss
  • Slow block reception
  • Network timeouts
Solutions:
  1. Monitor bandwidth usage:
    iftop -i eth0
    
  2. Limit QUIC connections:
    --tpu-max-staked-connections 1000 \
    --tpu-max-unstaked-connections 200
    
  3. Verify network hardware:
    • Check for 1 Gbps link
    • Verify no errors on interface:
      ethtool -S eth0 | grep errors
      
  4. Consider upgrade:
    • Upgrade to 10 Gbps if available

Voting Issues

Validator Not Voting

Symptoms:
  • No votes being submitted
  • Vote credits not increasing
  • Validator marked as delinquent
Solutions:
  1. Verify vote account configuration:
    solana vote-account $(solana-keygen pubkey ~/vote-account-keypair.json)
    
  2. Check validator identity has SOL:
    solana balance ~/validator-keypair.json
    # Should have enough for vote fees
    
  3. Verify vote account authority:
    # Ensure identity matches vote account authority
    solana vote-account $(solana-keygen pubkey ~/vote-account-keypair.json) | grep "Vote Authority"
    
  4. Check for --no-voting flag:
    # Ensure not running with --no-voting
    ps aux | grep agave-validator | grep no-voting
    
  5. Review logs for vote errors:
    sudo journalctl -u agave-validator | grep -i "vote.*error"
    

High Skip Rate

Symptoms:
  • Block production skip rate above 5%
  • Missing leader slots
  • Poor performance metrics
Solutions:
  1. Check system performance:
    # Verify adequate resources
    top
    iostat -x 5
    
  2. Verify network connectivity:
    # Check latency to other validators
    ping -c 10 <peer-ip>
    
  3. Optimize block production:
    --block-production-method central-scheduler
    
  4. Check PoH speed:
    # Look for PoH speed warnings in logs
    sudo journalctl -u agave-validator | grep "PoH speed"
    
  5. Consider hardware upgrade:
    • Faster CPU for PoH
    • Faster NVMe for ledger access

Log Analysis

Understanding Common Errors

”Slot X is not a descendant of Y”

Cause: Fork mismatch during replay Solution:
  • Usually resolves automatically
  • If persistent, may need to clear ledger and resync

”Transaction would exceed account data limit”

Cause: Transaction trying to allocate too much data Solution:
  • Not a validator issue
  • Transaction submitter needs to fix

”Blockstore error: SlotNotRooted”

Cause: Accessing non-rooted slot data Solution:
  • Usually transient
  • If persistent, check ledger integrity

”Tower vote failed”

Cause: Issue with tower voting logic Solution:
  • Check tower file integrity
  • Review recent consensus changes
  • May need to reset tower (only if instructed)

Log Filtering

Set appropriate log levels:
# Default info level
agave-validator set-log-filter info

# Debug specific components
agave-validator set-log-filter solana_core=debug,solana_runtime=info

# Trace transaction processing
agave-validator set-log-filter solana_runtime::message_processor=trace

Database Issues

RocksDB Corruption

Symptoms:
  • “Corruption: bad block contents” errors
  • Validator crashes on startup
  • Database errors in logs
Solutions:
  1. Stop validator:
    sudo systemctl stop agave-validator
    
  2. Backup current state:
    cp -r ~/validator-ledger ~/validator-ledger.backup
    
  3. Clear corrupted database:
    rm -rf ~/validator-ledger/rocksdb
    
  4. Restart validator:
    sudo systemctl start agave-validator
    
  5. If issue persists, full resync:
    rm -rf ~/validator-ledger
    sudo systemctl start agave-validator
    

Accounts Database Issues

Symptoms:
  • Account verification failures
  • Accounts hash mismatch
  • Bank hash mismatch
Solutions:
  1. Verify accounts integrity:
    --accounts-db-verify-refcounts
    
  2. Clear accounts cache:
    sudo systemctl stop agave-validator
    rm -rf ~/validator-ledger/accounts_cache
    sudo systemctl start agave-validator
    
  3. Full accounts resync:
    sudo systemctl stop agave-validator
    rm -rf ~/validator-ledger/accounts
    sudo systemctl start agave-validator
    

Emergency Procedures

Validator Completely Stuck

  1. Check if process is responsive:
    pgrep agave-validator
    
  2. Try graceful shutdown:
    agave-validator exit --force
    
  3. If unresponsive, force kill:
    sudo killall -9 agave-validator
    
  4. Check for core dumps:
    ls -lh /var/crash/
    
  5. Review system logs:
    sudo journalctl -xe
    
  6. Restart validator:
    sudo systemctl start agave-validator
    

Recovering from Delinquency

Steps:
  1. Verify validator is caught up:
    solana catchup $(solana-keygen pubkey ~/validator-keypair.json)
    
  2. Check vote account status:
    solana vote-account $(solana-keygen pubkey ~/vote-account-keypair.json)
    
  3. Monitor for vote submission:
    sudo journalctl -u agave-validator -f | grep vote
    
  4. Wait for next epoch:
    • Delinquency clears at epoch boundary
    • Continue monitoring vote credits
  5. If still delinquent, check:
    • System resources
    • Network connectivity
    • Vote account balance

Getting Help

Diagnostic Information to Collect

When seeking help, gather:
  1. Validator version:
    agave-validator --version
    
  2. System info:
    uname -a
    lscpu
    free -h
    df -h
    
  3. Recent logs:
    sudo journalctl -u agave-validator --since "1 hour ago" > validator-logs.txt
    
  4. Configuration:
    ps aux | grep agave-validator
    
  5. Network status:
    solana gossip | grep $(solana-keygen pubkey ~/validator-keypair.json)
    solana catchup $(solana-keygen pubkey ~/validator-keypair.json)
    

Community Resources

Preventive Maintenance

Regular Backups

Backup keypairs and tower state regularly.

Monitor Metrics

Set up comprehensive monitoring and alerts.

Keep Updated

Stay current with validator software updates.

Test Changes

Test configuration changes on testnet first.

Next Steps

Configuration

Review and optimize your configuration

Monitoring

Set up better monitoring to catch issues early

Operations

Learn more operational best practices

Build docs developers (and LLMs) love