Diagnostic Approach
Troubleshooting Workflow
- Identify symptoms - What is failing or slow?
- Check monitoring - Review metrics and dashboards
- Examine logs - Look for errors and warnings
- Isolate scope - Single node, table, or cluster-wide?
- Verify configuration - Check flags and settings
- Test hypothesis - Make targeted changes
- Document resolution - Update runbooks
Initial Health Check
Node Issues
Node Crashes and Restarts
Symptoms:- Node disappears from cluster
- Services not responding
- Frequent restarts visible in logs
-
Out of Memory:
- Solution: Reduce block cache size, increase RAM
- Flag:
--db_block_cache_size_bytes
-
Disk Full:
- Solution: Run log cleanup, expand disk, check compaction
- Script:
./bin/log_cleanup.sh
-
Corrupted Data:
- Solution: See Failed Tablets section
YB-TServer Crash Loop
When a tablet server repeatedly crashes and restarts: Recovery Steps:Node Not Joining Cluster
Symptoms:- New node added but not visible
- Node shows in list but no tablets assigned
- Verify
--tserver_master_addrsmatches master addresses - Check firewall rules allow ports 7100, 9100
- Ensure clocks are synchronized (NTP)
- Restart tserver service if configuration changed
Failed Tablets
Identifying Failed Tablets
Using helper script:- Tablet UUIDs in FAILED state
- Tablet server locations
- Tombstone commands for cleanup
Recovering Failed Tablets
Automatic recovery (preferred): YugabyteDB automatically triggers remote bootstrap for most failures. Wait 15-30 minutes for automatic recovery. Manual recovery:Root Cause Analysis
Before tombstoning tablets, investigate the cause:Performance Issues
High Latency
Symptoms:- Queries taking longer than normal
- Timeout errors
- User-reported slowness
-
High CPU Usage (>80%):
-
Disk I/O Saturation:
-
Memory Pressure:
-
Network Congestion:
Write Amplification
Symptoms:- High disk writes relative to application writes
- Slow write performance
- Disk wearing out quickly (SSDs)
-
Adjust compaction triggers:
-
Increase memstore size:
-
Use appropriate compaction style:
Query Timeouts
Symptoms:- Client timeout errors
- “Operation timed out” messages
-
Increase client timeout:
-
Increase RPC timeout:
-
Optimize slow queries:
- Add appropriate indexes
- Reduce data scanned
- Use prepared statements
Replication Issues
High Replication Lag
Symptoms:- Follower lag metric increasing
- Stale reads from followers
- Remote bootstrap failures
-
Network Latency:
- Check inter-node latency with
pingandiperf - Consider preferred zones for leader placement
- Check inter-node latency with
-
Follower Node Overloaded:
- Check CPU and disk utilization on follower
- Consider adding more nodes
-
Large Batches:
- Monitor batch sizes in metrics
- Adjust
--consensus_max_batch_size_bytes
Split-Brain Scenarios
Symptoms:- Multiple masters claiming leadership
- Inconsistent cluster state
- Write failures
- Check network partitions (firewall, switches)
- Verify NTP synchronization across all nodes
- If necessary, restart master processes in sequence
- In extreme cases, may need to rebuild master quorum
Disk Issues
Disk Space Running Out
Immediate Actions:- Expand disk capacity
- Add more nodes to distribute data
- Implement data retention policies
- Enable table TTL for time-series data
- Archive or drop old data
Disk I/O Errors
Symptoms:- I/O error messages in logs
- Failed tablets
- Node crashes
-
If single disk failed:
- Remove disk from fs_data_dirs
- Restart tserver
- Replace physical disk
- Add back to cluster
-
If critical disk failure:
- Decommission entire node
- Replace hardware
- Re-add node to cluster
Connection Issues
Too Many Connections
Symptoms:- “too many connections” errors
- Connection timeouts
- Unable to connect to database
-
Increase connection limit:
-
Enable connection manager:
-
Implement connection pooling:
- Use pgBouncer or application-level pooling
- Set appropriate pool sizes
- Configure connection timeouts
-
Kill idle connections:
SSL/TLS Connection Failures
Symptoms:- “SSL error” messages
- Connection refused with TLS
- Certificate expired
- Certificate hostname mismatch
- Missing intermediate certificates
- Incorrect certificate permissions (must be 0600)
Schema Issues
DDL Operation Stuck
Symptoms:- CREATE/ALTER/DROP not completing
- Table in transitional state
- CREATE INDEX may take time for backfill
- ALTER TABLE propagates across all tablets
- DROP TABLE waits for snapshot retention
- Check master leader status
- Review catalog manager state
- May need to restart master leader (last resort)
Monitoring and Metrics Issues
Metrics Endpoint Not Responding
Diagnosis:- Verify webserver ports not blocked
- Check
--webserver_portconfiguration - Ensure
--webserver_interfaceset correctly - Restart service if needed
Prometheus Scraping Failures
Check Prometheus targets:- Update target addresses in prometheus.yml
- Check network connectivity from Prometheus host
- Verify metrics_path matches endpoints
- Increase scrape_timeout if needed
Backup and Restore Issues
Snapshot Creation Fails
Diagnosis:-
Insufficient disk space:
- Snapshots use hardlinks but require directory space
- Solution: Free up disk space
-
Clock skew too high:
- Solution: Fix NTP synchronization
-
Tablet not responding:
- Solution: Check tablet health, increase timeout
Restore Failures
Common Issues:-
Namespace already exists:
-
Snapshot incomplete:
- Verify snapshot state is COMPLETE
- Check all tablet replicas included
-
Version incompatibility:
- Test restore in staging first
- Check release notes for breaking changes
Cluster Maintenance
Rolling Restart
Perform rolling restarts to minimize downtime:- Restart followers before leaders
- Wait for node to fully rejoin before next restart
- Monitor cluster health between restarts
- Avoid restarting during peak traffic
Upgrading Cluster
Follow upgrade process carefully:- Review release notes for breaking changes
- Test upgrade in staging environment
- Take full backup before upgrading
- Perform rolling upgrade (TServers then Masters)
- Verify functionality after each node
- Monitor for issues during upgrade window
Getting Help
Information to Gather
When seeking support, collect:Support Resources
- Documentation: https://docs.yugabyte.com
- Community Slack: https://yugabyte-db.slack.com
- GitHub Issues: https://github.com/yugabyte/yugabyte-db/issues
- Support Portal: For YugabyteDB customers
Next Steps
- Admin Guide - Administrative tasks and tools
- Performance Tuning - Optimize cluster performance
- Monitoring - Set up proactive monitoring
- Backup and Restore - Data protection strategies

