Built-in Monitoring Tools
CockroachDB includes several tools to help you monitor cluster workloads and performance.DB Console
The DB Console provides a web UI for monitoring cluster health and performance. Access it athttp://<node-host>:8080 (default port).
Key features:
- Real-time cluster metrics and time-series data
- Node status and health information
- SQL statement and transaction statistics
- Storage and replication status
- Hardware resource utilization
Metrics in DB Console are collected every 10 minutes by default and stored within the cluster. Data is retained at 10-second granularity for 10 days, and at 30-minute granularity for 90 days.
Prometheus Endpoint
Each node exports time-series metrics in Prometheus format at:http://<node-host>:8080/_status/vars(legacy)http://<node-host>:8080/metrics(v25.3+)
- Collect metrics at custom intervals
- Store historical data according to your retention requirements
- Create custom dashboards and alerts
- Access metrics even when the cluster is unavailable
Setting Up Prometheus Monitoring
Install Prometheus
- Download the Prometheus tarball for your OS from prometheus.io/download
- Extract the binary and add it to your system PATH
- Verify installation:
Setting Up Alertmanager
Key Metrics to Monitor
Node Health
Node status and availability
Node status and availability
Monitor whether nodes are online and responsive:
- Metric: Check node status on DB Console Cluster Overview page
- Alert: Node has been down for 15+ minutes
- Prometheus metric:
sys_uptime(node uptime in seconds)
CPU utilization
CPU utilization
Monitor CPU consumption by CockroachDB processes:
- Healthy range: 30-80% sustained usage
- Warning: Consistently >80% indicates potential CPU starvation
- Prometheus metric:
sys_cpu_combined_percent_normalized
Memory usage
Memory usage
Track memory allocated to SQL and storage:
- SQL Memory: Monitor
sql.mem.currentfor active query memory - Warning: SQL memory >75% of
--max-sql-memoryindicates potential OOM risk - Prometheus metric:
sys_rss(resident set size)
Performance Metrics
SQL query latency
SQL query latency
Track time between query receipt and execution completion:
- P99 latency: 99th percentile response time
- Healthy: Under 100ms for simple queries
- Prometheus metric:
sql_exec_latency
Queries per second (QPS)
Queries per second (QPS)
Monitor query throughput:
- Metric: Total SELECT, INSERT, UPDATE, DELETE statements per second
- Use: Identify traffic patterns and capacity planning
- Prometheus metric:
sql_select_count,sql_insert_count, etc.
Storage Health
Disk capacity
Disk capacity
Monitor available storage space:
- Warning: Under 15% free space
- Critical: Under 10% free space (nodes may shut down)
- Prometheus metric:
capacity_available,capacity_used
LSM health
LSM health
Monitor Log-Structured Merge-tree health:
- Healthy: Level 0 files under 20
- Warning: Level 0 files 20-100 (compaction falling behind)
- Critical: Level 0 files over 100 (inverted LSM)
- Prometheus metric:
storage_l0_sublevels
Read amplification
Read amplification
Track disk reads per logical SQL statement:
- Healthy: Under 10
- Warning: 10-20 (compaction may be struggling)
- Critical: Over 20 (indicates LSM health issues)
- Prometheus metric:
storage_read_amplification
Replication Status
Under-replicated ranges
Under-replicated ranges
Ranges with fewer replicas than configured:
- Healthy: 0 under-replicated ranges
- Warning: Any under-replicated ranges indicate data at risk
- Prometheus metric:
ranges_underreplicated
Unavailable ranges
Unavailable ranges
Health Check Endpoints
CockroachDB provides HTTP endpoints for health checks:/health Endpoint
Basic node liveness check:
- Returns: HTTP 200 if node is running
- Returns: Connection refused if node is down
/health?ready=1 Endpoint
Node readiness check for load balancers:
- Returns: HTTP 200 if node is ready to serve traffic
- Returns: HTTP 503 if node is draining or cannot reach cluster majority
Use
/health?ready=1 in load balancer health checks to automatically route traffic away from nodes during rolling upgrades or maintenance.Grafana Dashboards
Visualize metrics with Grafana dashboards:Install Grafana
Download and install from grafana.com/grafana/download
Add Prometheus datasource
Configure Prometheus as a datasource in Grafana:
- Name: Prometheus
- Type: Prometheus
- URL:
http://localhost:9090 - Access: Direct
Best Practices
Alerting strategy
Alerting strategy
- Configure alerts for critical metrics (unavailable ranges, node down, low disk space)
- Set appropriate thresholds to avoid alert fatigue
- Use different notification channels for different severity levels
- Test alert notifications regularly
- Document runbooks for common alert scenarios
Metric retention
Metric retention
- Store high-resolution metrics for recent data (10s granularity for 1-7 days)
- Use downsampled metrics for historical data (5m granularity for 30-90 days)
- Archive long-term metrics to cold storage if needed for compliance
- Plan storage capacity based on retention requirements
Dashboard organization
Dashboard organization
- Create separate dashboards for different teams (SRE, developers, management)
- Include business metrics alongside technical metrics
- Use consistent time ranges across related dashboards
- Add annotations for deployments and configuration changes
- Keep dashboards simple and focused on actionable insights