Monitoring

Overview

Monitoring is essential for maintaining healthy CockroachDB clusters. CockroachDB provides multiple monitoring interfaces including the Admin UI, HTTP endpoints, and Prometheus metrics exports for integration with external monitoring systems.

Admin UI

The Admin UI is a web-based interface for monitoring cluster health and performance.

Accessing the Admin UI

By default, the Admin UI runs on port 8080:

http://localhost:8080

For secure clusters, access requires authentication:

# Create admin user
CREATE USER admin WITH PASSWORD 'secure_password';
GRANT admin TO admin;

Admin UI Features

Cluster Overview

The overview page displays:

Node count: Total nodes in cluster
Capacity usage: Storage utilization
QPS: Queries per second
SQL connections: Active SQL connections
Replication status: Under-replicated ranges
Node health: Live vs. dead nodes

Metrics Dashboard

Time-series graphs for:

SQL query latency (p50, p99, p99.9)
QPS breakdown (reads vs. writes)
CPU usage per node
Memory usage
Disk I/O (reads, writes, IOPS)
Network throughput
Replication lag

Node Map

Visual representation of:

Node locations (based on locality)
Node health status
Replica distribution
Locality-based organization

Database and Table Details

Per-database and per-table metrics:

Table sizes
Row counts
Read/write activity
Index usage statistics
Replication status

Health Check Endpoints

Basic Health Check

Simple HTTP health check:

curl http://localhost:8080/health

Returns:

200 OK: Node is healthy
503 Service Unavailable: Node is unhealthy

Readiness Check

Check if node is ready to serve traffic:

curl http://localhost:8080/health?ready=1

Returns 200 OK when:

Node has joined the cluster
Node is not draining
Node can communicate with majority of cluster

Use readiness checks for load balancer health checks and Kubernetes readiness probes.

Prometheus Metrics

CockroachDB exports metrics in Prometheus format for integration with monitoring systems.

Metrics Endpoint

Access Prometheus metrics:

curl http://localhost:8080/_status/vars

Example output:

# HELP replicas Number of replicas
# TYPE replicas gauge
replicas{store="1"} 142

# HELP capacity_available Available storage capacity
# TYPE capacity_available gauge  
capacity_available{store="1"} 95821094912

# HELP sql_conns Number of active SQL connections
# TYPE sql_conns gauge
sql_conns 23

Load Metrics Endpoint

Subset of metrics for lightweight monitoring:

curl http://localhost:8080/_status/load

Returns key metrics:

CPU usage (user and system)
Uptime
Query rate
Active connections

The /_status/load endpoint is unauthenticated and designed for external load balancers.

Prometheus Configuration

Configure Prometheus to scrape CockroachDB metrics:

prometheus.yml

scrape_configs:
  - job_name: 'cockroachdb'
    static_configs:
      - targets: 
        - 'node1:8080'
        - 'node2:8080'
        - 'node3:8080'
    metrics_path: '/_status/vars'
    scrape_interval: 10s

Add scrape configuration

Edit prometheus.yml with CockroachDB node endpoints.

Reload Prometheus

curl -X POST http://localhost:9090/-/reload

Verify targets

Check Prometheus UI at http://localhost:9090/targets

Key Metrics to Monitor

Node Health Metrics

Liveness and Availability

liveness_livenodes

Number of live nodes in cluster
Alert if drops below expected count

node_id

Node identifier
Track node restarts and failures

sys_uptime

Node uptime in seconds
Detect unexpected restarts

Performance Metrics

SQL Performance

Metric	Description	Alert Threshold
`sql_query_count`	Total queries executed	Monitor trends
`sql_exec_latency_p50`	Median query latency	> 100ms
`sql_exec_latency_p99`	99th percentile latency	> 1s
`sql_exec_latency_p999`	99.9th percentile latency	> 5s
`sql_conns`	Active SQL connections	> 80% of max
`sql_txn_abort_count`	Transaction aborts	High abort rate

Storage Metrics

Metric	Description	Alert Threshold
`capacity_available`	Available disk space	< 15% free
`capacity_used`	Used disk space	Monitor trends
`livedata`	Live data bytes	Track growth
`sysbytes`	System data bytes	Unexpected growth
`valbytes`	Value bytes	Monitor trends
`keybytes`	Key bytes	Monitor trends

Replication Metrics

Metric	Description	Alert Threshold
`ranges_unavailable`	Unavailable ranges	> 0
`ranges_underreplicated`	Under-replicated ranges	> 0 for extended time
`ranges_overreplicated`	Over-replicated ranges	> 0 for extended time
`replicas`	Total replicas on node	Imbalanced distribution
`replicas_leaders`	Raft leaders on node	Imbalanced distribution
`replicas_leaseholders`	Leaseholders on node	Imbalanced distribution

System Resource Metrics

Metric	Description	Alert Threshold
`sys_cpu_user_percent`	User CPU usage	> 80% sustained
`sys_cpu_sys_percent`	System CPU usage	> 30% sustained
`sys_rss`	Resident memory	> 90% of max
`sys_go_allocbytes`	Go allocated memory	Monitor trends
`sys_host_disk_read_bytes`	Disk read bytes	High I/O wait
`sys_host_disk_write_bytes`	Disk write bytes	High I/O wait
`sys_host_net_recv_bytes`	Network bytes received	Bandwidth saturation
`sys_host_net_send_bytes`	Network bytes sent	Bandwidth saturation

Monitoring Queries

SQL-Based Monitoring

CockroachDB provides internal tables for monitoring:

SELECT 
  node_id,
  address,
  build,
  started_at,
  updated_at,
  is_available,
  is_live
FROM crdb_internal.gossip_nodes;

Cluster Insights

-- Check transaction contention
SELECT * FROM crdb_internal.cluster_contention_events
ORDER BY num_contention_events DESC
LIMIT 10;

-- View active queries
SELECT 
  query_id,
  node_id,
  user_name,
  start,
  query
FROM crdb_internal.cluster_queries
WHERE application_name NOT LIKE '$ internal%';

-- Check cluster settings
SELECT 
  variable,
  value,
  description
FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable LIKE 'sql.defaults%';

Alerting

Critical Alerts

Set up alerts for critical conditions:

Node Unavailability

Alert: Node becomes unavailable

up{job="cockroachdb"} == 0

Response: Investigate node immediately

Unavailable Ranges

Alert: Ranges become unavailable

ranges_unavailable > 0

Response: Check node liveness and network connectivity

Disk Space

Alert: Low disk space

(capacity_available / (capacity_available + capacity_used)) < 0.15

Response: Add storage or scale cluster

High Query Latency

Alert: p99 latency exceeds threshold

histogram_quantile(0.99, rate(sql_exec_latency_bucket[5m])) > 1

Response: Investigate slow queries and resource usage

Warning Alerts

Under-replicated Ranges

Alert: Ranges under-replicated for extended time

ranges_underreplicated > 0

Duration: 10 minutesResponse: Check node capacity and replication

High CPU Usage

Alert: Sustained high CPU

sys_cpu_user_percent > 80

Duration: 5 minutesResponse: Analyze workload and consider scaling

Grafana Integration

Visualize CockroachDB metrics with Grafana:

Add Prometheus data source

Configure Grafana to use Prometheus as a data source.

Import dashboard

CockroachDB provides official Grafana dashboards. Import dashboard ID or JSON.

Customize panels

Add custom panels for your specific monitoring needs.

Sample Grafana Queries

# Query rate
rate(sql_query_count[1m])

# p99 latency
histogram_quantile(0.99, rate(sql_exec_latency_bucket[5m]))

# Available capacity
sum(capacity_available) by (cluster)

# Active SQL connections
sum(sql_conns) by (node_id)

Best Practices

Monitoring Recommendations

Monitor all nodes: Don’t rely on single node metrics
Set up alerting: Proactive alerts prevent outages
Track baselines: Understand normal behavior
Monitor trends: Capacity planning requires historical data
Use external monitoring: Don’t rely solely on internal metrics
Test alerts: Verify alert delivery and runbooks
Monitor the monitors: Ensure monitoring system is healthy
Document thresholds: Record why alert thresholds were chosen

Troubleshooting

Metrics Not Appearing

Verify /_status/vars endpoint is accessible
Check Prometheus scrape configuration
Review Prometheus logs for errors
Ensure no firewall blocking port 8080

High Memory Usage

Check --cache and --max-sql-memory settings
Monitor sys_rss and sys_go_allocbytes
Review query patterns for memory-intensive operations
Consider adjusting GOMEMLIMIT

Performance Degradation

Check CPU and disk I/O metrics
Review SQL query latency percentiles
Investigate under-replicated or unavailable ranges
Analyze slow query log
Check for transaction contention

Getting Started

Architecture

SQL Reference

Administration

Operations

Performance

Overview

Admin UI

Accessing the Admin UI

Admin UI Features

Health Check Endpoints

Basic Health Check

Readiness Check

Prometheus Metrics

Metrics Endpoint

Load Metrics Endpoint

Prometheus Configuration

Key Metrics to Monitor

Node Health Metrics

Performance Metrics

Monitoring Queries

SQL-Based Monitoring

Cluster Insights

Alerting

Critical Alerts

Warning Alerts

Grafana Integration

Sample Grafana Queries

Best Practices

Troubleshooting

Metrics Not Appearing

High Memory Usage

Performance Degradation

See Also

Build docs developers (and LLMs) love

Getting Started

Architecture

SQL Reference

Administration

Operations

Performance

​Overview

​Admin UI

​Accessing the Admin UI

​Admin UI Features

​Health Check Endpoints

​Basic Health Check

​Readiness Check

​Prometheus Metrics

​Metrics Endpoint

​Load Metrics Endpoint

​Prometheus Configuration

​Key Metrics to Monitor

​Node Health Metrics

​Performance Metrics

​Monitoring Queries

​SQL-Based Monitoring

​Cluster Insights

​Alerting

​Critical Alerts

​Warning Alerts

​Grafana Integration

​Sample Grafana Queries

​Best Practices

​Troubleshooting

​Metrics Not Appearing

​High Memory Usage

​Performance Degradation

​See Also

Build docs developers (and LLMs) love

Overview

Admin UI

Accessing the Admin UI

Admin UI Features

Health Check Endpoints

Basic Health Check

Readiness Check

Prometheus Metrics

Metrics Endpoint

Load Metrics Endpoint

Prometheus Configuration

Key Metrics to Monitor

Node Health Metrics

Performance Metrics

Monitoring Queries

SQL-Based Monitoring

Cluster Insights

Alerting

Critical Alerts

Warning Alerts

Grafana Integration

Sample Grafana Queries

Best Practices

Troubleshooting

Metrics Not Appearing

High Memory Usage

Performance Degradation

See Also