Overview
Monitoring is essential for maintaining healthy CockroachDB clusters. CockroachDB provides multiple monitoring interfaces including the Admin UI, HTTP endpoints, and Prometheus metrics exports for integration with external monitoring systems.Admin UI
The Admin UI is a web-based interface for monitoring cluster health and performance.Accessing the Admin UI
By default, the Admin UI runs on port 8080:Admin UI Features
Cluster Overview
Cluster Overview
The overview page displays:
- Node count: Total nodes in cluster
- Capacity usage: Storage utilization
- QPS: Queries per second
- SQL connections: Active SQL connections
- Replication status: Under-replicated ranges
- Node health: Live vs. dead nodes
Metrics Dashboard
Metrics Dashboard
Time-series graphs for:
- SQL query latency (p50, p99, p99.9)
- QPS breakdown (reads vs. writes)
- CPU usage per node
- Memory usage
- Disk I/O (reads, writes, IOPS)
- Network throughput
- Replication lag
Node Map
Node Map
Visual representation of:
- Node locations (based on locality)
- Node health status
- Replica distribution
- Locality-based organization
Database and Table Details
Database and Table Details
Per-database and per-table metrics:
- Table sizes
- Row counts
- Read/write activity
- Index usage statistics
- Replication status
Health Check Endpoints
Basic Health Check
Simple HTTP health check:200 OK: Node is healthy503 Service Unavailable: Node is unhealthy
Readiness Check
Check if node is ready to serve traffic:200 OK when:
- Node has joined the cluster
- Node is not draining
- Node can communicate with majority of cluster
Use readiness checks for load balancer health checks and Kubernetes readiness probes.
Prometheus Metrics
CockroachDB exports metrics in Prometheus format for integration with monitoring systems.Metrics Endpoint
Access Prometheus metrics:Load Metrics Endpoint
Subset of metrics for lightweight monitoring:- CPU usage (user and system)
- Uptime
- Query rate
- Active connections
The
/_status/load endpoint is unauthenticated and designed for external load balancers.Prometheus Configuration
Configure Prometheus to scrape CockroachDB metrics:prometheus.yml
Key Metrics to Monitor
Node Health Metrics
Liveness and Availability
Liveness and Availability
liveness_livenodes
- Number of live nodes in cluster
- Alert if drops below expected count
- Node identifier
- Track node restarts and failures
- Node uptime in seconds
- Detect unexpected restarts
Performance Metrics
SQL Performance
SQL Performance
| Metric | Description | Alert Threshold |
|---|---|---|
sql_query_count | Total queries executed | Monitor trends |
sql_exec_latency_p50 | Median query latency | > 100ms |
sql_exec_latency_p99 | 99th percentile latency | > 1s |
sql_exec_latency_p999 | 99.9th percentile latency | > 5s |
sql_conns | Active SQL connections | > 80% of max |
sql_txn_abort_count | Transaction aborts | High abort rate |
Storage Metrics
Storage Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
capacity_available | Available disk space | < 15% free |
capacity_used | Used disk space | Monitor trends |
livedata | Live data bytes | Track growth |
sysbytes | System data bytes | Unexpected growth |
valbytes | Value bytes | Monitor trends |
keybytes | Key bytes | Monitor trends |
Replication Metrics
Replication Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
ranges_unavailable | Unavailable ranges | > 0 |
ranges_underreplicated | Under-replicated ranges | > 0 for extended time |
ranges_overreplicated | Over-replicated ranges | > 0 for extended time |
replicas | Total replicas on node | Imbalanced distribution |
replicas_leaders | Raft leaders on node | Imbalanced distribution |
replicas_leaseholders | Leaseholders on node | Imbalanced distribution |
System Resource Metrics
System Resource Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
sys_cpu_user_percent | User CPU usage | > 80% sustained |
sys_cpu_sys_percent | System CPU usage | > 30% sustained |
sys_rss | Resident memory | > 90% of max |
sys_go_allocbytes | Go allocated memory | Monitor trends |
sys_host_disk_read_bytes | Disk read bytes | High I/O wait |
sys_host_disk_write_bytes | Disk write bytes | High I/O wait |
sys_host_net_recv_bytes | Network bytes received | Bandwidth saturation |
sys_host_net_send_bytes | Network bytes sent | Bandwidth saturation |
Monitoring Queries
SQL-Based Monitoring
CockroachDB provides internal tables for monitoring:Cluster Insights
Alerting
Critical Alerts
Set up alerts for critical conditions:Node Unavailability
Node Unavailability
Unavailable Ranges
Unavailable Ranges
Disk Space
Disk Space
Alert: Low disk spaceResponse: Add storage or scale cluster
High Query Latency
High Query Latency
Alert: p99 latency exceeds thresholdResponse: Investigate slow queries and resource usage
Warning Alerts
Under-replicated Ranges
Under-replicated Ranges
Alert: Ranges under-replicated for extended timeDuration: 10 minutesResponse: Check node capacity and replication
High CPU Usage
High CPU Usage
Alert: Sustained high CPUDuration: 5 minutesResponse: Analyze workload and consider scaling
Grafana Integration
Visualize CockroachDB metrics with Grafana:Sample Grafana Queries
Best Practices
Monitoring Recommendations
Monitoring Recommendations
- Monitor all nodes: Don’t rely on single node metrics
- Set up alerting: Proactive alerts prevent outages
- Track baselines: Understand normal behavior
- Monitor trends: Capacity planning requires historical data
- Use external monitoring: Don’t rely solely on internal metrics
- Test alerts: Verify alert delivery and runbooks
- Monitor the monitors: Ensure monitoring system is healthy
- Document thresholds: Record why alert thresholds were chosen
Troubleshooting
Metrics Not Appearing
- Verify
/_status/varsendpoint is accessible - Check Prometheus scrape configuration
- Review Prometheus logs for errors
- Ensure no firewall blocking port 8080
High Memory Usage
- Check
--cacheand--max-sql-memorysettings - Monitor
sys_rssandsys_go_allocbytes - Review query patterns for memory-intensive operations
- Consider adjusting
GOMEMLIMIT
Performance Degradation
- Check CPU and disk I/O metrics
- Review SQL query latency percentiles
- Investigate under-replicated or unavailable ranges
- Analyze slow query log
- Check for transaction contention
See Also
- Metrics and Observability - Detailed metrics reference
- Logging - Log collection and analysis
- Configuration - Performance tuning settings