Troubleshooting - YugabyteDB

This guide provides systematic approaches to diagnosing and resolving common YugabyteDB operational issues.

Diagnostic Approach

Troubleshooting Workflow

Identify symptoms - What is failing or slow?
Check monitoring - Review metrics and dashboards
Examine logs - Look for errors and warnings
Isolate scope - Single node, table, or cluster-wide?
Verify configuration - Check flags and settings
Test hypothesis - Make targeted changes
Document resolution - Update runbooks

Initial Health Check

# Check cluster status
yb-admin --master_addresses ip1:7100,ip2:7100,ip3:7100 \
  get_universe_config

# List all servers
yb-admin --master_addresses ip1:7100,ip2:7100,ip3:7100 \
  list_all_tablet_servers

# Check for failed tablets
./bin/yb-check-failed-tablets.sh \
  --master_addresses ip1:7100,ip2:7100,ip3:7100

Node Issues

Node Crashes and Restarts

Symptoms:

Node disappears from cluster
Services not responding
Frequent restarts visible in logs

Diagnosis:

# Check if services are running
ps aux | grep yb-tserver
ps aux | grep yb-master

# View recent crash logs
tail -100 /home/yugabyte/tserver/logs/yb-tserver.FATAL
tail -100 /home/yugabyte/tserver/logs/yb-tserver.ERROR

# Check for OOM killer
dmesg | grep -i "out of memory"
grep -i "killed process" /var/log/messages

# Review stderr output
cat /home/yugabyte/tserver/tserver.err

Common Causes:

Out of Memory:
- Solution: Reduce block cache size, increase RAM
- Flag: --db_block_cache_size_bytes
Disk Full:
- Solution: Run log cleanup, expand disk, check compaction
- Script: ./bin/log_cleanup.sh
Corrupted Data:
- Solution: See Failed Tablets section

YB-TServer Crash Loop

When a tablet server repeatedly crashes and restarts: Recovery Steps:

# 1. Stop the service (YugabyteDB Anywhere)
yb-server-ctl tserver stop

# 2. Find problematic tablets from logs
grep -i "FATAL\|CHECK\|SIGSEGV" /home/yugabyte/tserver/logs/yb-tserver.*
# Note the tablet UUIDs mentioned

# 3. Remove failed tablet data
find /mnt/disk1 -name '*<tablet-uuid>*' | xargs rm -rf
# Repeat for all disks in --fs_data_dirs

# 4. Restart the service
yb-server-ctl tserver start

The cluster will automatically re-replicate missing tablets.

Node Not Joining Cluster

Symptoms:

New node added but not visible
Node shows in list but no tablets assigned

Diagnosis:

# Check master connectivity
curl http://master-ip:7000

# Verify network connectivity
ping <master-ip>
telnet <master-ip> 7100

# Check master addresses configuration
grep tserver_master_addrs /home/yugabyte/tserver/conf/server.conf

# View heartbeat status in logs
grep -i "heartbeat" /home/yugabyte/tserver/logs/yb-tserver.INFO

Resolution:

Verify --tserver_master_addrs matches master addresses
Check firewall rules allow ports 7100, 9100
Ensure clocks are synchronized (NTP)
Restart tserver service if configuration changed

Failed Tablets

Identifying Failed Tablets

Using helper script:

./bin/yb-check-failed-tablets.sh \
  --master_addresses ip1:7100,ip2:7100,ip3:7100 \
  --binary_path /path/to/yugabyte/bin

Output provides:

Tablet UUIDs in FAILED state
Tablet server locations
Tombstone commands for cleanup

Manual check:

# List tablets for a tserver
yb-admin --master_addresses ip1:7100,ip2:7100,ip3:7100 \
  list_tablets_for_tablet_server <tserver-uuid>

# Look for FAILED state in output

Recovering Failed Tablets

Automatic recovery (preferred): YugabyteDB automatically triggers remote bootstrap for most failures. Wait 15-30 minutes for automatic recovery. Manual recovery:

# 1. Find tablet data on disk
find /mnt/d0 -name '*<tablet-uuid>*'

# 2. Delete tablet data
yb-ts-cli --server_address=<tserver-ip>:9000 \
  delete_tablet <tablet-uuid> "Manual recovery after failure"

# 3. Trigger remote bootstrap (automatic)
# Cluster will detect missing replica and re-replicate

Root Cause Analysis

Before tombstoning tablets, investigate the cause:

# Check tablet-specific logs
grep <tablet-uuid> /home/yugabyte/tserver/logs/yb-tserver.*

# Look for:
# - Corruption messages
# - Disk I/O errors
# - Memory allocation failures
# - Raft consensus issues

Performance Issues

High Latency

Symptoms:

Queries taking longer than normal
Timeout errors
User-reported slowness

Diagnosis:

# Check metrics endpoints
curl http://tserver-ip:9000/metrics | grep latency

# View slow operations log
grep "took.*ms" /home/yugabyte/tserver/logs/yb-tserver.WARNING

# Check YSQL slow queries
SELECT query, calls, mean_exec_time, max_exec_time 
FROM pg_stat_statements 
ORDER BY mean_exec_time DESC LIMIT 10;

Common Causes and Solutions:

High CPU Usage (>80%):

# Check CPU per node
top -b -n 1 | grep yb-tserver

# Solutions:
# - Add more nodes
# - Optimize queries
# - Reduce compaction threads temporarily

Disk I/O Saturation:

# Check disk utilization
iostat -x 5

# Solutions:
# - Use faster disks (NVMe)
# - Add more data directories
# - Adjust compaction settings

Memory Pressure:

# Check memory usage
curl http://tserver-ip:9000/mem-trackers

# Solutions:
# - Reduce block cache size
# - Add more RAM
# - Reduce concurrent operations

Network Congestion:

# Check network stats
iftop -i eth0

# Solutions:
# - Enable compression for cross-AZ traffic
# - Upgrade network capacity
# - Review replication factor

Write Amplification

Symptoms:

High disk writes relative to application writes
Slow write performance
Disk wearing out quickly (SSDs)

Diagnosis:

# Check compaction stats
curl http://tserver-ip:9000/metrics | grep rocksdb_compact

# Calculate write amplification
# WAF = (bytes_written_to_disk) / (bytes_written_by_app)

Solutions:

Adjust compaction triggers:

--rocksdb_level0_file_num_compaction_trigger=8

Increase memstore size:
```
--memstore_size_mb=256
```

Use appropriate compaction style:

# For YSQL workloads
--rocksdb_compact_flush_rate_limit_bytes_per_sec=268435456

Query Timeouts

Symptoms:

Client timeout errors
“Operation timed out” messages

Diagnosis:

# Check RPC timeout settings
grep timeout /home/yugabyte/tserver/conf/server.conf

# Review timeout errors in logs
grep -i "timeout" /home/yugabyte/tserver/logs/yb-tserver.WARNING

# Check if operations are queued
curl http://tserver-ip:9000/metrics | grep queue

Solutions:

Increase client timeout:

# For YSQL
statement_timeout = '30s'

# For YCQL
request_timeout_in_ms = 30000

Increase RPC timeout:
```
--rpc_timeout_ms=30000  # 30 seconds
```
Optimize slow queries:
- Add appropriate indexes
- Reduce data scanned
- Use prepared statements

Replication Issues

High Replication Lag

Symptoms:

Follower lag metric increasing
Stale reads from followers
Remote bootstrap failures

Diagnosis:

# Check replication lag
curl http://tserver-ip:9000/metrics | grep follower_lag_ms

# View consensus metrics
curl http://tserver-ip:9000/metrics | grep consensus

# Check for slow tablets
yb-admin --master_addresses ip1:7100,ip2:7100,ip3:7100 \
  list_tablets <keyspace>.<table> 0 | grep LEADER

Common Causes:

Network Latency:
- Check inter-node latency with ping and iperf
- Consider preferred zones for leader placement
Follower Node Overloaded:
- Check CPU and disk utilization on follower
- Consider adding more nodes
Large Batches:
- Monitor batch sizes in metrics
- Adjust --consensus_max_batch_size_bytes

Split-Brain Scenarios

Symptoms:

Multiple masters claiming leadership
Inconsistent cluster state
Write failures

Diagnosis:

# Check master leadership
for master in ip1 ip2 ip3; do
  echo "Checking $master:"
  curl -s http://$master:7000/api/v1/is-leader
done

# Verify master quorum
yb-admin --master_addresses ip1:7100,ip2:7100,ip3:7100 \
  list_all_masters

Resolution:

Check network partitions (firewall, switches)
Verify NTP synchronization across all nodes
If necessary, restart master processes in sequence
In extreme cases, may need to rebuild master quorum

Disk Issues

Disk Space Running Out

Immediate Actions:

# Check disk usage
df -h

# Run log cleanup
./bin/log_cleanup.sh --logs_disk_percent_max 5

# Clean core dumps
rm /var/yugabyte/cores/*

# Check for large WAL files
du -sh /mnt/d*/yb-data/*/wals/*

Identify Space Usage:

# Find largest directories
du -h /mnt/d0/yb-data | sort -rh | head -20

# Check tablet sizes
find /mnt/d0/yb-data/tserver/data -name "*.sst" -exec ls -lh {} \; | \
  awk '{sum+=$5} END {print sum/(1024^3) " GB"}'

Long-Term Solutions:

Expand disk capacity
Add more nodes to distribute data
Implement data retention policies
Enable table TTL for time-series data
Archive or drop old data

Disk I/O Errors

Symptoms:

I/O error messages in logs
Failed tablets
Node crashes

Diagnosis:

# Check system logs for disk errors
dmesg | grep -i "error\|fail"
grep -i "I/O error" /var/log/messages

# Check SMART status
sudo smartctl -a /dev/sda

# Test disk health
sudo badblocks -sv /dev/sda

Resolution:

If single disk failed:
- Remove disk from fs_data_dirs
- Restart tserver
- Replace physical disk
- Add back to cluster
If critical disk failure:
- Decommission entire node
- Replace hardware
- Re-add node to cluster

Connection Issues

Too Many Connections

Symptoms:

“too many connections” errors
Connection timeouts
Unable to connect to database

Diagnosis:

# Check current connections (YSQL)
SELECT count(*) FROM pg_stat_activity;

# Check connection limit
SHOW max_connections;

# View connection sources
SELECT client_addr, count(*) 
FROM pg_stat_activity 
GROUP BY client_addr 
ORDER BY count DESC;

Solutions:

Increase connection limit:
```
--ysql_max_connections=500
```

Enable connection manager:

--enable_ysql_conn_mgr=true
--ysql_conn_mgr_max_client_connections=10000

Implement connection pooling:
- Use pgBouncer or application-level pooling
- Set appropriate pool sizes
- Configure connection timeouts

Kill idle connections:

-- Find long-idle connections
SELECT pid, usename, application_name, state, state_change
FROM pg_stat_activity
WHERE state = 'idle' 
  AND state_change < now() - interval '1 hour';

-- Terminate if needed
SELECT pg_terminate_backend(pid);

SSL/TLS Connection Failures

Symptoms:

“SSL error” messages
Connection refused with TLS

Diagnosis:

# Test SSL connection
openssl s_client -connect tserver-ip:5433 -starttls postgres

# Check certificate validity
openssl x509 -in /path/to/cert.crt -text -noout

# Verify certificate chain
openssl verify -CAfile ca.crt server.crt

Common Issues:

Certificate expired
Certificate hostname mismatch
Missing intermediate certificates
Incorrect certificate permissions (must be 0600)

Schema Issues

DDL Operation Stuck

Symptoms:

CREATE/ALTER/DROP not completing
Table in transitional state

Diagnosis:

# Check ongoing DDL operations
yb-admin --master_addresses ip1:7100,ip2:7100,ip3:7100 \
  list_tables include_table_id

# Check master logs for DDL
grep -i "create\|alter\|drop" /home/yugabyte/master/logs/yb-master.INFO

# Look for backfill status (indexes)
grep "backfill" /home/yugabyte/master/logs/yb-master.INFO

Resolution: Most DDL operations are asynchronous:

CREATE INDEX may take time for backfill
ALTER TABLE propagates across all tablets
DROP TABLE waits for snapshot retention

If truly stuck:

Check master leader status
Review catalog manager state
May need to restart master leader (last resort)

Monitoring and Metrics Issues

Metrics Endpoint Not Responding

Diagnosis:

# Test endpoints
curl -v http://tserver-ip:9000/metrics
curl -v http://master-ip:7000/metrics

# Check if service is running
ps aux | grep yb-tserver
netstat -tlnp | grep 9000

# Check webserver logs
grep -i "webserver" /home/yugabyte/tserver/logs/yb-tserver.INFO

Solutions:

Verify webserver ports not blocked
Check --webserver_port configuration
Ensure --webserver_interface set correctly
Restart service if needed

Prometheus Scraping Failures

Check Prometheus targets:

http://prometheus:9090/targets

Look for DOWN targets and error messages. Common fixes:

Update target addresses in prometheus.yml
Check network connectivity from Prometheus host
Verify metrics_path matches endpoints
Increase scrape_timeout if needed

Backup and Restore Issues

Snapshot Creation Fails

Diagnosis:

# Check snapshot status
yb-admin --master_addresses ip1:7100,ip2:7100,ip3:7100 \
  list_snapshots

# View master logs
grep -i "snapshot" /home/yugabyte/master/logs/yb-master.WARNING

Common Causes:

Insufficient disk space:
- Snapshots use hardlinks but require directory space
- Solution: Free up disk space
Clock skew too high:
- Solution: Fix NTP synchronization
Tablet not responding:
- Solution: Check tablet health, increase timeout

Restore Failures

Common Issues:

Namespace already exists:

# Drop existing namespace first
DROP DATABASE IF EXISTS target_db;

Snapshot incomplete:
- Verify snapshot state is COMPLETE
- Check all tablet replicas included
Version incompatibility:
- Test restore in staging first
- Check release notes for breaking changes

Cluster Maintenance

Rolling Restart

Perform rolling restarts to minimize downtime:

# 1. Restart TServers one at a time
for tserver in tserver1 tserver2 tserver3; do
  echo "Restarting $tserver"
  ssh $tserver "yb-server-ctl tserver restart"
  sleep 60  # Wait for node to rejoin
  
  # Verify node is back
  yb-admin --master_addresses ip1:7100,ip2:7100,ip3:7100 \
    list_all_tablet_servers | grep $tserver
done

# 2. Restart Masters one at a time
for master in master1 master2 master3; do
  echo "Restarting $master"
  ssh $master "yb-server-ctl master restart"
  sleep 60
  
  # Verify master rejoined
  yb-admin --master_addresses ip1:7100,ip2:7100,ip3:7100 \
    list_all_masters | grep $master
done

Best Practices:

Restart followers before leaders
Wait for node to fully rejoin before next restart
Monitor cluster health between restarts
Avoid restarting during peak traffic

Upgrading Cluster

Follow upgrade process carefully:

Review release notes for breaking changes
Test upgrade in staging environment
Take full backup before upgrading
Perform rolling upgrade (TServers then Masters)
Verify functionality after each node
Monitor for issues during upgrade window

Getting Help

Information to Gather

When seeking support, collect:

# 1. Cluster configuration
yb-admin --master_addresses ip1:7100,ip2:7100,ip3:7100 \
  get_universe_config > universe_config.txt

# 2. Node configurations
cat /home/yugabyte/tserver/conf/server.conf
cat /home/yugabyte/master/conf/server.conf

# 3. Recent logs (last 1000 lines)
tail -1000 /home/yugabyte/tserver/logs/yb-tserver.INFO
tail -1000 /home/yugabyte/tserver/logs/yb-tserver.WARNING
tail -1000 /home/yugabyte/tserver/logs/yb-tserver.ERROR

# 4. Metrics snapshot
curl http://tserver-ip:9000/metrics > metrics_snapshot.txt

# 5. System information
uname -a
df -h
free -h
top -b -n 1

Support Resources

Documentation: https://docs.yugabyte.com
Community Slack: https://yugabyte-db.slack.com
GitHub Issues: https://github.com/yugabyte/yugabyte-db/issues
Support Portal: For YugabyteDB customers

Next Steps

Admin Guide - Administrative tasks and tools
Performance Tuning - Optimize cluster performance
Monitoring - Set up proactive monitoring
Backup and Restore - Data protection strategies

Get Started

Core Concepts

Deployment

Develop

Operations

Security

Advanced Features

​Diagnostic Approach

​Troubleshooting Workflow

​Initial Health Check

​Node Issues

​Node Crashes and Restarts

​YB-TServer Crash Loop

​Node Not Joining Cluster

​Failed Tablets

​Identifying Failed Tablets

​Recovering Failed Tablets

​Root Cause Analysis

​Performance Issues

​High Latency

​Write Amplification

​Query Timeouts

​Replication Issues

​High Replication Lag

​Split-Brain Scenarios

​Disk Issues

​Disk Space Running Out

​Disk I/O Errors

​Connection Issues

​Too Many Connections

​SSL/TLS Connection Failures

​Schema Issues

​DDL Operation Stuck

​Monitoring and Metrics Issues

​Metrics Endpoint Not Responding

​Prometheus Scraping Failures

​Backup and Restore Issues

​Snapshot Creation Fails

​Restore Failures

​Cluster Maintenance

​Rolling Restart

​Upgrading Cluster

​Getting Help

​Information to Gather

​Support Resources

​Next Steps

Build docs developers (and LLMs) love

Diagnostic Approach

Troubleshooting Workflow

Initial Health Check

Node Issues

Node Crashes and Restarts

YB-TServer Crash Loop

Node Not Joining Cluster

Failed Tablets

Identifying Failed Tablets

Recovering Failed Tablets

Root Cause Analysis

Performance Issues

High Latency

Write Amplification

Query Timeouts

Replication Issues

High Replication Lag

Split-Brain Scenarios

Disk Issues

Disk Space Running Out

Disk I/O Errors

Connection Issues

Too Many Connections

SSL/TLS Connection Failures

Schema Issues

DDL Operation Stuck

Monitoring and Metrics Issues

Metrics Endpoint Not Responding

Prometheus Scraping Failures

Backup and Restore Issues

Snapshot Creation Fails

Restore Failures

Cluster Maintenance

Rolling Restart

Upgrading Cluster

Getting Help

Information to Gather

Support Resources

Next Steps