Skip to main content

Overview

This guide helps you identify and resolve performance issues in CockroachDB. Learn how to diagnose slow queries, high latency, resource bottlenecks, and other common problems.

Diagnostic Approach

1

Identify symptoms

Determine what users are experiencing: slow queries, timeouts, high latency.
2

Gather metrics

Check cluster metrics, query statistics, and system resources.
3

Isolate the cause

Use EXPLAIN, logs, and monitoring to pinpoint the bottleneck.
4

Apply fixes

Implement targeted optimizations based on root cause.
5

Verify resolution

Monitor metrics to confirm the issue is resolved.

Common Performance Issues

Slow Queries

Symptoms

  • Queries taking longer than expected
  • Timeouts in application
  • High P99 latencies

Diagnosis

-- Find slow queries
SELECT 
  query,
  count,
  mean_latency,
  max_latency,
  p99_latency
FROM crdb_internal.node_statement_statistics
WHERE mean_latency > INTERVAL '1 second'
ORDER BY mean_latency DESC
LIMIT 10;
-- Analyze specific query
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 'cust123';

Solutions

-- Problem: Full table scan
EXPLAIN SELECT * FROM orders WHERE status = 'pending';
-- Shows: FULL SCAN

-- Solution: Create index
CREATE INDEX idx_status ON orders (status);

High Transaction Conflicts

Symptoms

  • RETRY_WRITE_TOO_OLD errors
  • RETRY_SERIALIZABLE errors
  • High transaction retry rate

Diagnosis

-- Check transaction conflict rate
SELECT
  txn_fingerprint_id,
  count,
  contention_time,
  max_contention_time
FROM crdb_internal.node_txn_stats
WHERE contention_time > INTERVAL '100ms'
ORDER BY contention_time DESC
LIMIT 10;
-- View contention events
SELECT * FROM crdb_internal.cluster_contention_events
ORDER BY collection_ts DESC
LIMIT 20;

Solutions

Transaction conflicts occur when concurrent transactions access the same rows. Reduce contention by:
  • Shortening transaction duration
  • Using SELECT FOR UPDATE
  • Avoiding hot spots
  • Batching updates
-- Problem: Lost updates from concurrent modifications
BEGIN;
  SELECT balance FROM accounts WHERE id = 'acc123';
  -- Application logic
  UPDATE accounts SET balance = 500 WHERE id = 'acc123';
COMMIT;

-- Solution: Lock the row
BEGIN;
  SELECT balance FROM accounts WHERE id = 'acc123' FOR UPDATE;
  UPDATE accounts SET balance = 500 WHERE id = 'acc123';
COMMIT;

Hotspots and Range Splits

Symptoms

  • Uneven load distribution across nodes
  • High CPU on specific nodes
  • Sequential key inserts causing bottlenecks

Diagnosis

-- View range distribution
SHOW RANGES FROM TABLE orders;

-- Check for hotspots
SELECT
  range_id,
  lease_holder,
  replica_localities,
  range_size_mb
FROM crdb_internal.ranges
WHERE table_name = 'orders'
ORDER BY range_size_mb DESC;

Solutions

-- Problem: Sequential primary key creates hotspot
CREATE TABLE events (
  id INT PRIMARY KEY,
  data JSONB
);
-- All writes go to same range

-- Solution: Use hash-sharded key
CREATE TABLE events (
  id INT PRIMARY KEY USING HASH WITH (bucket_count = 16),
  data JSONB
);

High Memory Usage

Symptoms

  • Out of memory (OOM) errors
  • Slow garbage collection
  • Node crashes

Diagnosis

-- Check memory-intensive queries
SELECT
  query,
  count,
  max_mem_usage,
  mean_mem_usage
FROM crdb_internal.node_statement_statistics
WHERE max_mem_usage > 1024 * 1024 * 1024  -- 1GB
ORDER BY max_mem_usage DESC
LIMIT 10;

Solutions

Reduce memory usage by:
  • Adding LIMIT clauses to large result sets
  • Breaking large transactions into batches
  • Using streaming instead of loading all results
  • Reducing sort and hash join sizes
-- Problem: Large result set
SELECT * FROM orders;  -- Returns millions of rows

-- Solution: Paginate results
SELECT * FROM orders ORDER BY id LIMIT 1000 OFFSET 0;

-- Problem: Large join
SELECT o.*, c.* FROM orders o CROSS JOIN customers c;

-- Solution: Add filter
SELECT o.*, c.* FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > NOW() - INTERVAL '1 day';

Replication Lag

Symptoms

  • Followers falling behind leaseholder
  • Increased read latency
  • UNAVAILABLE range errors

Diagnosis

-- Check replication status
SELECT
  range_id,
  lease_holder,
  replicas,
  unavailable_replicas,
  underreplicated_replicas
FROM crdb_internal.ranges_no_leases
WHERE unavailable_replicas > 0 OR underreplicated_replicas > 0;

Solutions

  • Increase node resources (CPU, disk I/O)
  • Reduce write load or batch writes
  • Check network latency between nodes
  • Ensure sufficient replica count

Disk I/O Bottlenecks

Symptoms

  • High disk latency
  • Slow write performance
  • storage: file is closed errors

Diagnosis

Monitor disk metrics:
  • IOPS (Input/Output Operations Per Second)
  • Read/write latency
  • Queue depth

Solutions

Disk I/O optimization:
  1. Use SSD storage (required for production)
  2. Separate WAL (write-ahead log) and data directories
  3. Enable write combining where possible
  4. Reduce batch sizes for large writes
  5. Monitor disk utilization and expand when needed

Network Latency

Symptoms

  • High cross-region query latency
  • Slow distributed queries
  • RPC timeouts

Diagnosis

-- Check query distribution
EXPLAIN (DISTSQL) SELECT * FROM orders WHERE customer_id = 'cust123';
-- Look for cross-region data movement

-- View network statistics
SELECT * FROM crdb_internal.cluster_queries
WHERE distributed = true
ORDER BY start DESC
LIMIT 10;

Solutions

-- Reduce cross-region latency for read-heavy workloads
SET TRANSACTION AS OF SYSTEM TIME follower_read_timestamp();
SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '7 days';

Diagnostic Queries

Cluster Health

-- Node status
SELECT node_id, address, build_tag, started_at, updated_at 
FROM crdb_internal.gossip_liveness;

-- Range health
SELECT 
  COUNT(*) AS total_ranges,
  SUM(CASE WHEN unavailable_replicas > 0 THEN 1 ELSE 0 END) AS unavailable,
  SUM(CASE WHEN underreplicated_replicas > 0 THEN 1 ELSE 0 END) AS underreplicated
FROM crdb_internal.ranges_no_leases;

Query Statistics

-- Top queries by execution time
SELECT
  query,
  count,
  ROUND(mean_latency::NUMERIC, 2) AS avg_latency_ms,
  ROUND((count * mean_latency)::NUMERIC, 2) AS total_time_ms
FROM crdb_internal.node_statement_statistics
ORDER BY total_time_ms DESC
LIMIT 20;

-- Failed queries
SELECT
  query,
  count,
  failure_count,
  ROUND((failure_count::FLOAT / count * 100)::NUMERIC, 2) AS failure_rate
FROM crdb_internal.node_statement_statistics
WHERE failure_count > 0
ORDER BY failure_count DESC;

Resource Usage

-- Table sizes
SELECT
  table_name,
  ROUND(total_bytes / 1024 / 1024 / 1024, 2) AS size_gb,
  range_count
FROM crdb_internal.table_sizes
ORDER BY total_bytes DESC
LIMIT 10;

-- Index usage
SELECT
  table_name,
  index_name,
  total_reads,
  last_read
FROM crdb_internal.index_usage_statistics
WHERE total_reads = 0 AND created_at < NOW() - INTERVAL '7 days'
ORDER BY table_name;

Monitoring Tools

Admin UI

Access the Admin UI at http://<node-address>:8080:
  • Overview: Cluster health and capacity
  • Metrics: CPU, memory, disk, network
  • SQL Activity: Query statistics and slow queries
  • Network Latency: Round-trip times between nodes
  • Replication: Range distribution and health

Command-Line Tools

# Node status
cockroach node status --host=<node-address>

# Range status
cockroach node ranges --host=<node-address>

# Debug zip for support
cockroach debug zip /tmp/debug.zip --host=<node-address>

Escalation Checklist

When to contact support:
Critical issues requiring immediate attention:
  • Cluster unavailability or data loss
  • Majority of nodes down
  • Persistent OOM errors
  • Data corruption detected
  • Security incidents
Provide the following when contacting support:
  1. Debug zip: cockroach debug zip
  2. Cluster version and configuration
  3. Reproduction steps
  4. Recent changes to cluster or workload
  5. EXPLAIN output for problematic queries
  6. Timeline of when issue started

Performance Tuning Workflow

1

Establish baseline

Measure current performance: QPS, latency percentiles, error rate.
2

Identify bottleneck

Use metrics and queries above to find the limiting factor.
3

Test hypothesis

Apply fix in staging environment first.
4

Measure impact

Compare before/after metrics.
5

Deploy and monitor

Roll out to production with close monitoring.

Quick Reference

Common fixes by symptom:
SymptomLikely CauseQuick Fix
Slow SELECTMissing indexCREATE INDEX
High retriesContentionSELECT FOR UPDATE, smaller txns
Uneven loadHotspotHash sharding, range splits
OOM errorsLarge result setsAdd LIMIT, paginate
High P99Network latencyFollower reads, regional tables
Slow writesSequential keysUUID or hash-sharded keys
Index joinMissing covering indexAdd STORING clause
Stale estimatesOld statisticsCREATE STATISTICS

Build docs developers (and LLMs) love