Skip to main content
This guide helps you troubleshoot common issues when running CockroachDB clusters.

Initial Troubleshooting Steps

When you experience issues, start with these steps:
1

Check logs for errors

Logs are generated on a per-node basis:
# View recent logs
tail -f /var/log/cockroach/cockroach.log

# Search for errors
grep ERROR /var/log/cockroach/cockroach.log

# Collect logs from all nodes
cockroach debug zip debug.zip --host=node1:26257
The debug zip command collects logs, metrics, and diagnostics from all cluster nodes into a single archive for troubleshooting.
2

Check node and cluster status

# Check node status
cockroach node status --host=node1:26257

# Check cluster status with details
cockroach node status --all --host=node1:26257

# Check node decommissioning status
cockroach node status --decommission --host=node1:26257
3

Review DB Console

Access DB Console at http://<node-host>:8080 to check:
  • Node health on Cluster Overview page
  • Under-replicated or unavailable ranges
  • CPU, memory, and disk usage
  • Recent SQL activity and slow queries

Cluster Setup Issues

Cannot Start Single-Node Cluster

Problem: Node won’t start due to existing cluster data.
ERROR: node belongs to cluster {cluster-id} but is attempting 
to connect to a gossip network for cluster {different-cluster-id}
Solution:
# Option 1: Use different directory
cockroach start-single-node \
  --store=/new/data/path \
  --insecure

# Option 2: Remove existing directory
rm -rf cockroach-data/
cockroach start-single-node --insecure
Problem: Default ports 26257 or 8080 are occupied.
ERROR: listen tcp 127.0.0.1:26257: bind: address already in use
Solution:
# Use different ports
cockroach start-single-node \
  --listen-addr=localhost:26258 \
  --http-addr=localhost:8081 \
  --insecure

# Or stop conflicting services
lsof -i :26257  # Find process using port
kill <pid>      # Stop the process
Problem: Exit status 132 (SIGILL) indicates unsupported CPU instructions.Solution:
  • Use official CockroachDB release builds (support all x86-64 CPUs)
  • Verify binary is correct for your architecture
  • Check if running very old CPU without required instruction sets

Multi-Node Cluster Issues

Problem: Node won’t join cluster with --join flag.
W180817 17:01:56.506968 886 vendor/google.golang.org/grpc/clientconn.go:942 
Failed to dial localhost:26257: grpc: the connection is closing; please retry.
Diagnosis:
# Test connectivity
ping node1.example.com
telnet node1.example.com 26257

# Check if node is running
ps aux | grep cockroach

# Verify join address
cockroach node status --host=node1:26257
Solutions:
# Use correct join address
cockroach start \
  --certs-dir=certs \
  --advertise-addr=node2.example.com:26257 \
  --join=node1.example.com:26257 \
  --store=/data/cockroach

# Verify network connectivity
# Check firewall rules allow port 26257
# Ensure DNS resolution works for node hostnames
Problem: Cluster slows down during node additions.Cause: kv.snapshot_rebalance.max_rate set too high causes write overload.
-- Check current setting
SHOW CLUSTER SETTING kv.snapshot_rebalance.max_rate;

-- Check LSM health
SELECT store_id, l0_sublevels, l0_num_files
FROM crdb_internal.kv_store_status;
Solution:
-- Reduce rebalance rate
SET CLUSTER SETTING kv.snapshot_rebalance.max_rate = '32 MiB';

-- Wait for compaction to catch up
-- Monitor L0 sublevels until < 20

-- Reset to default after rebalancing completes
RESET CLUSTER SETTING kv.snapshot_rebalance.max_rate;
Do not increase kv.snapshot_rebalance.max_rate more than 2x default without explicit guidance from Cockroach Labs.

Connection Issues

Cannot Connect with SQL Client

ERROR: cannot dial server: dial tcp 127.0.0.1:26257: connect: connection refused
Diagnosis:
# Check if node is running
ps aux | grep cockroach

# Test network connectivity
telnet localhost 26257

# Check listening ports
netstat -an | grep 26257
Solutions:
  • Ensure node is running: cockroach node status
  • Verify port number matches node configuration
  • Include flags used during node start (e.g., --port, --host)
  • Check firewall rules allow connection
ERROR: x509: certificate signed by unknown authority
Solution:
# For secure cluster, provide certificates
cockroach sql \
  --certs-dir=certs \
  --host=node1.example.com:26257

# Or specify in connection string
psql "postgresql://user@node1:26257/db?sslmode=require&sslrootcert=certs/ca.crt&sslcert=certs/client.user.crt&sslkey=certs/client.user.key"

Performance Issues

High Query Latency

-- Find slow queries
SELECT 
  query,
  count,
  avg_latency,
  p99_latency,
  rows_avg
FROM crdb_internal.statement_statistics
WHERE avg_latency > INTERVAL '100ms'
ORDER BY avg_latency DESC
LIMIT 20;

-- Check currently running queries
SELECT 
  query_id,
  node_id,
  user_name,
  start,
  query,
  phase
FROM crdb_internal.cluster_queries
WHERE start < (now() - INTERVAL '5s')
ORDER BY start;
Common causes:
  1. Missing indexes: Use EXPLAIN to identify full table scans
  2. High CPU usage: Check CPU metrics, reduce concurrency
  3. Disk I/O bottleneck: Monitor disk IOPS and latency
  4. Transaction contention: Check crdb_internal.cluster_contention_events
-- View contention events
SELECT 
  waiting_txn_id,
  blocking_txn_id,
  contention_duration,
  table_name,
  index_name,
  key
FROM crdb_internal.cluster_contention_events
ORDER BY contention_duration DESC
LIMIT 20;

-- Check locks
SHOW CLUSTER LOCKS;
Solutions:
  • Reduce transaction duration (keep transactions short)
  • Avoid hot keys (use UUID or composite keys)
  • Use SELECT FOR UPDATE to explicitly lock
  • Consider optimistic locking patterns

High CPU Usage

-- Check CPU usage per node
SELECT 
  node_id,
  user_cpu_percent,
  sys_cpu_percent
FROM crdb_internal.kv_node_status;

-- Find CPU-intensive queries
SELECT 
  query,
  count,
  sum_latency,
  avg_latency
FROM crdb_internal.statement_statistics
ORDER BY sum_latency DESC
LIMIT 20;
Common causes:
  1. Excessive concurrency: Too many active queries
  2. Inefficient queries: Full table scans, missing indexes
  3. Compaction falling behind: Check LSM health
  4. Under-provisioned cluster: Need more CPU cores
Solutions:
  • Limit connection pool size to ~4x vCPU count
  • Optimize slow queries (add indexes, rewrite)
  • Scale horizontally (add more nodes)
  • Use connection pooling (PgBouncer)

Memory Issues

Symptoms: Nodes restart unexpectedly.Diagnosis:
# Check system logs for OOM
sudo dmesg | grep -i "out of memory"
sudo dmesg | grep -i cockroach

# Check Kubernetes pod restarts
kubectl get pods | grep cockroach

# Review active query dumps
ls -lh <logging-dir>/heap_profiler/activequeryprof.*.csv
Solutions:
# Increase node memory
# or reduce SQL memory allocation
cockroach start \
  --certs-dir=certs \
  --max-sql-memory=6GiB \
  --cache=8GiB \
  ...
-- Find memory-intensive queries
SELECT 
  query,
  count,
  max_mem_usage
FROM crdb_internal.statement_statistics
WHERE max_mem_usage > 1073741824  -- 1GB
ORDER BY max_mem_usage DESC;

-- Optimize or limit these queries
  • Disable swap: sudo swapoff -a
  • Set --max-sql-memory to 25-30% of total RAM
  • Set --cache to 25% of total RAM

Storage Issues

Problem: Nodes shut down when disk space < 10%.
-- Check disk capacity
SELECT 
  store_id,
  node_id,
  capacity,
  available,
  used,
  (available::FLOAT / capacity::FLOAT * 100)::INT AS percent_available
FROM crdb_internal.kv_store_status
ORDER BY percent_available;
Solutions:
-- Reduce GC TTL to reclaim space faster
ALTER RANGE default CONFIGURE ZONE USING gc.ttlseconds = 3600;  -- 1 hour

-- Drop unused databases/tables
DROP DATABASE old_data CASCADE;

-- Expire old backups
SHOW BACKUPS IN 'external://backup-location';
-- Delete old backup directories
# Add more disk space (cloud volumes)
# AWS example
aws ec2 modify-volume --volume-id vol-xxx --size 200

# Resize filesystem after volume expansion
sudo resize2fs /dev/nvme1n1
Nodes automatically shut down when disk space falls below 10% to prevent data corruption. Monitor disk usage and set alerts at 15-20% free space.
Problem: High L0 sublevels indicate compaction falling behind.
-- Check LSM health
SELECT 
  store_id,
  l0_sublevels,
  l0_num_files,
  storage_read_amplification
FROM crdb_internal.kv_store_status
WHERE l0_sublevels > 20;
Solutions:
-- Reduce write load temporarily
-- Wait for compaction to catch up

-- Increase compaction concurrency (if CPU available)
SET CLUSTER SETTING rocksdb.min_wal_sync_interval = '500µs';

-- After recovery, reset snapshot rate
RESET CLUSTER SETTING kv.snapshot_rebalance.max_rate;
Prevention:
  • Ensure adequate CPU resources
  • Don’t set snapshot rebalance rate too high
  • Monitor L0 sublevels continuously
  • Scale cluster before reaching capacity limits

Replication Issues

Problem: Some ranges have fewer replicas than configured.
-- Check under-replicated ranges
SHOW RANGES FROM DATABASE mydb WITH DETAILS;

-- Count under-replicated ranges
SELECT COUNT(*) 
FROM crdb_internal.ranges 
WHERE array_length(replicas, 1) < 3;  -- assuming 3 replicas

-- View replication status
SHOW CLUSTER SETTING kv.replication.reports.interval;
Common causes:
  1. Node failure or network partition
  2. Insufficient nodes for replication factor
  3. Constraint violations (locality constraints)
  4. Slow replication due to network or disk issues
Solutions:
  • Ensure all nodes are healthy and reachable
  • Verify cluster has enough nodes for replication factor
  • Check and fix zone configuration constraints
  • Monitor replication queue length and duration
Problem: Some ranges cannot serve requests (critical issue).
# Check critical nodes status
curl -X POST http://localhost:8080/_status/critical_nodes

# Look for unavailable ranges in output
Immediate actions:
  1. Check if majority of replicas are reachable
  2. Verify network connectivity between nodes
  3. Check node health (CPU, memory, disk)
  4. Review recent changes (deployments, config changes)
Recovery:
  • If nodes are down, restore them immediately
  • If data is lost, may require restore from backup
  • Contact Cockroach Labs support for guidance
Unavailable ranges indicate data that cannot be read or written. This is a critical situation requiring immediate attention.

Common Error Messages

ERROR: restart transaction: TransactionRetryWithProtoRefreshError: 
transaction deadline exceeded
Causes:
  • Transaction took too long (exceeded deadline)
  • Contention with other transactions
  • Node failures during transaction
Solutions:
  • Implement retry logic in application
  • Reduce transaction duration
  • Use AS OF SYSTEM TIME for read-only queries
  • Investigate and resolve contention
# Example retry logic (Python)
from psycopg2.extensions import TransactionRollbackError

max_retries = 3
for attempt in range(max_retries):
    try:
        with conn.cursor() as cur:
            cur.execute("BEGIN")
            # ... transaction statements ...
            cur.execute("COMMIT")
        break
    except TransactionRollbackError as e:
        if attempt == max_retries - 1:
            raise
        time.sleep(0.1 * (2 ** attempt))  # Exponential backoff
ERROR: x509: certificate has expired or is not yet valid
Solution:
# Check certificate expiration
cockroach cert list --certs-dir=certs

# Create new certificates before expiration
cockroach cert create-node \
  node1.example.com \
  --certs-dir=certs \
  --ca-key=my-safe-directory/ca.key

# Replace certificates on nodes (rolling restart)
Monitor certificate expiration dates and set up alerts 30 days before expiry.

Getting Help

If you cannot resolve the issue:
1

Collect diagnostic information

# Create debug bundle
cockroach debug zip debug.zip \
  --host=node1:26257 \
  --certs-dir=certs

# Include time range if issue is historical
cockroach debug zip debug.zip \
  --host=node1:26257 \
  --from='2024-03-01 10:00:00' \
  --to='2024-03-01 11:00:00'
3

Contact support

Build docs developers (and LLMs) love