Skip to main content

Overview

Monitoring is essential for maintaining healthy CockroachDB clusters. CockroachDB provides multiple monitoring interfaces including the Admin UI, HTTP endpoints, and Prometheus metrics exports for integration with external monitoring systems.

Admin UI

The Admin UI is a web-based interface for monitoring cluster health and performance.

Accessing the Admin UI

By default, the Admin UI runs on port 8080:
http://localhost:8080
For secure clusters, access requires authentication:
# Create admin user
CREATE USER admin WITH PASSWORD 'secure_password';
GRANT admin TO admin;

Admin UI Features

The overview page displays:
  • Node count: Total nodes in cluster
  • Capacity usage: Storage utilization
  • QPS: Queries per second
  • SQL connections: Active SQL connections
  • Replication status: Under-replicated ranges
  • Node health: Live vs. dead nodes
Time-series graphs for:
  • SQL query latency (p50, p99, p99.9)
  • QPS breakdown (reads vs. writes)
  • CPU usage per node
  • Memory usage
  • Disk I/O (reads, writes, IOPS)
  • Network throughput
  • Replication lag
Visual representation of:
  • Node locations (based on locality)
  • Node health status
  • Replica distribution
  • Locality-based organization
Per-database and per-table metrics:
  • Table sizes
  • Row counts
  • Read/write activity
  • Index usage statistics
  • Replication status

Health Check Endpoints

Basic Health Check

Simple HTTP health check:
curl http://localhost:8080/health
Returns:
  • 200 OK: Node is healthy
  • 503 Service Unavailable: Node is unhealthy

Readiness Check

Check if node is ready to serve traffic:
curl http://localhost:8080/health?ready=1
Returns 200 OK when:
  • Node has joined the cluster
  • Node is not draining
  • Node can communicate with majority of cluster
Use readiness checks for load balancer health checks and Kubernetes readiness probes.

Prometheus Metrics

CockroachDB exports metrics in Prometheus format for integration with monitoring systems.

Metrics Endpoint

Access Prometheus metrics:
curl http://localhost:8080/_status/vars
Example output:
# HELP replicas Number of replicas
# TYPE replicas gauge
replicas{store="1"} 142

# HELP capacity_available Available storage capacity
# TYPE capacity_available gauge  
capacity_available{store="1"} 95821094912

# HELP sql_conns Number of active SQL connections
# TYPE sql_conns gauge
sql_conns 23

Load Metrics Endpoint

Subset of metrics for lightweight monitoring:
curl http://localhost:8080/_status/load
Returns key metrics:
  • CPU usage (user and system)
  • Uptime
  • Query rate
  • Active connections
The /_status/load endpoint is unauthenticated and designed for external load balancers.

Prometheus Configuration

Configure Prometheus to scrape CockroachDB metrics:
prometheus.yml
scrape_configs:
  - job_name: 'cockroachdb'
    static_configs:
      - targets: 
        - 'node1:8080'
        - 'node2:8080'
        - 'node3:8080'
    metrics_path: '/_status/vars'
    scrape_interval: 10s
1

Add scrape configuration

Edit prometheus.yml with CockroachDB node endpoints.
2

Reload Prometheus

curl -X POST http://localhost:9090/-/reload
3

Verify targets

Check Prometheus UI at http://localhost:9090/targets

Key Metrics to Monitor

Node Health Metrics

liveness_livenodes
  • Number of live nodes in cluster
  • Alert if drops below expected count
node_id
  • Node identifier
  • Track node restarts and failures
sys_uptime
  • Node uptime in seconds
  • Detect unexpected restarts

Performance Metrics

MetricDescriptionAlert Threshold
sql_query_countTotal queries executedMonitor trends
sql_exec_latency_p50Median query latency> 100ms
sql_exec_latency_p9999th percentile latency> 1s
sql_exec_latency_p99999.9th percentile latency> 5s
sql_connsActive SQL connections> 80% of max
sql_txn_abort_countTransaction abortsHigh abort rate
MetricDescriptionAlert Threshold
capacity_availableAvailable disk space< 15% free
capacity_usedUsed disk spaceMonitor trends
livedataLive data bytesTrack growth
sysbytesSystem data bytesUnexpected growth
valbytesValue bytesMonitor trends
keybytesKey bytesMonitor trends
MetricDescriptionAlert Threshold
ranges_unavailableUnavailable ranges> 0
ranges_underreplicatedUnder-replicated ranges> 0 for extended time
ranges_overreplicatedOver-replicated ranges> 0 for extended time
replicasTotal replicas on nodeImbalanced distribution
replicas_leadersRaft leaders on nodeImbalanced distribution
replicas_leaseholdersLeaseholders on nodeImbalanced distribution
MetricDescriptionAlert Threshold
sys_cpu_user_percentUser CPU usage> 80% sustained
sys_cpu_sys_percentSystem CPU usage> 30% sustained
sys_rssResident memory> 90% of max
sys_go_allocbytesGo allocated memoryMonitor trends
sys_host_disk_read_bytesDisk read bytesHigh I/O wait
sys_host_disk_write_bytesDisk write bytesHigh I/O wait
sys_host_net_recv_bytesNetwork bytes receivedBandwidth saturation
sys_host_net_send_bytesNetwork bytes sentBandwidth saturation

Monitoring Queries

SQL-Based Monitoring

CockroachDB provides internal tables for monitoring:
SELECT 
  node_id,
  address,
  build,
  started_at,
  updated_at,
  is_available,
  is_live
FROM crdb_internal.gossip_nodes;

Cluster Insights

-- Check transaction contention
SELECT * FROM crdb_internal.cluster_contention_events
ORDER BY num_contention_events DESC
LIMIT 10;

-- View active queries
SELECT 
  query_id,
  node_id,
  user_name,
  start,
  query
FROM crdb_internal.cluster_queries
WHERE application_name NOT LIKE '$ internal%';

-- Check cluster settings
SELECT 
  variable,
  value,
  description
FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable LIKE 'sql.defaults%';

Alerting

Critical Alerts

Set up alerts for critical conditions:
Alert: Node becomes unavailable
up{job="cockroachdb"} == 0
Response: Investigate node immediately
Alert: Ranges become unavailable
ranges_unavailable > 0
Response: Check node liveness and network connectivity
Alert: Low disk space
(capacity_available / (capacity_available + capacity_used)) < 0.15
Response: Add storage or scale cluster
Alert: p99 latency exceeds threshold
histogram_quantile(0.99, rate(sql_exec_latency_bucket[5m])) > 1
Response: Investigate slow queries and resource usage

Warning Alerts

Alert: Ranges under-replicated for extended time
ranges_underreplicated > 0
Duration: 10 minutesResponse: Check node capacity and replication
Alert: Sustained high CPU
sys_cpu_user_percent > 80
Duration: 5 minutesResponse: Analyze workload and consider scaling

Grafana Integration

Visualize CockroachDB metrics with Grafana:
1

Add Prometheus data source

Configure Grafana to use Prometheus as a data source.
2

Import dashboard

CockroachDB provides official Grafana dashboards. Import dashboard ID or JSON.
3

Customize panels

Add custom panels for your specific monitoring needs.

Sample Grafana Queries

# Query rate
rate(sql_query_count[1m])

# p99 latency
histogram_quantile(0.99, rate(sql_exec_latency_bucket[5m]))

# Available capacity
sum(capacity_available) by (cluster)

# Active SQL connections
sum(sql_conns) by (node_id)

Best Practices

  1. Monitor all nodes: Don’t rely on single node metrics
  2. Set up alerting: Proactive alerts prevent outages
  3. Track baselines: Understand normal behavior
  4. Monitor trends: Capacity planning requires historical data
  5. Use external monitoring: Don’t rely solely on internal metrics
  6. Test alerts: Verify alert delivery and runbooks
  7. Monitor the monitors: Ensure monitoring system is healthy
  8. Document thresholds: Record why alert thresholds were chosen

Troubleshooting

Metrics Not Appearing

  • Verify /_status/vars endpoint is accessible
  • Check Prometheus scrape configuration
  • Review Prometheus logs for errors
  • Ensure no firewall blocking port 8080

High Memory Usage

  • Check --cache and --max-sql-memory settings
  • Monitor sys_rss and sys_go_allocbytes
  • Review query patterns for memory-intensive operations
  • Consider adjusting GOMEMLIMIT

Performance Degradation

  • Check CPU and disk I/O metrics
  • Review SQL query latency percentiles
  • Investigate under-replicated or unavailable ranges
  • Analyze slow query log
  • Check for transaction contention

See Also

Build docs developers (and LLMs) love