Skip to main content
Active monitoring of your CockroachDB cluster is critical for maintaining health and performance in production. This guide covers the monitoring and observability tools available for CockroachDB.

Built-in Monitoring Tools

CockroachDB includes several tools to help you monitor cluster workloads and performance.
If a cluster becomes unavailable, most built-in monitoring tools also become unavailable. Configure a third-party monitoring tool like Prometheus or Datadog to collect metrics periodically from the Prometheus endpoint. These metrics are stored outside the cluster and can help troubleshoot what led to an outage.

DB Console

The DB Console provides a web UI for monitoring cluster health and performance. Access it at http://<node-host>:8080 (default port). Key features:
  • Real-time cluster metrics and time-series data
  • Node status and health information
  • SQL statement and transaction statistics
  • Storage and replication status
  • Hardware resource utilization
Metrics in DB Console are collected every 10 minutes by default and stored within the cluster. Data is retained at 10-second granularity for 10 days, and at 30-minute granularity for 90 days.

Prometheus Endpoint

Each node exports time-series metrics in Prometheus format at:
  • http://<node-host>:8080/_status/vars (legacy)
  • http://<node-host>:8080/metrics (v25.3+)
You can scrape these endpoints with Prometheus or other monitoring tools to:
  • Collect metrics at custom intervals
  • Store historical data according to your retention requirements
  • Create custom dashboards and alerts
  • Access metrics even when the cluster is unavailable

Setting Up Prometheus Monitoring

1

Install Prometheus

  1. Download the Prometheus tarball for your OS from prometheus.io/download
  2. Extract the binary and add it to your system PATH
  3. Verify installation:
prometheus --version
2

Configure Prometheus

Create a prometheus.yml configuration file:
global:
  scrape_interval: 10s
  external_labels:
    cluster: 'cockroachdb-prod'

scrape_configs:
  - job_name: 'cockroachdb'
    metrics_path: '/_status/vars'
    scheme: 'http'  # Use 'https' for secure clusters
    static_configs:
      - targets: 
        - 'node1:8080'
        - 'node2:8080'
        - 'node3:8080'

rule_files:
  - 'rules/aggregation.rules.yml'
  - 'rules/alerts.rules.yml'
For production clusters, update the targets field with actual hostnames and ports for all nodes. Ensure your network allows TCP communication on the specified ports.
3

Download alerting rules

CockroachDB provides pre-built Prometheus rules:
mkdir rules
cd rules

# Download aggregation rules
curl -O https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/rules/aggregation.rules.yml

# Download alerting rules
curl -O https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/rules/alerts.rules.yml
4

Start Prometheus

prometheus --config.file=prometheus.yml
Access the Prometheus UI at http://localhost:9090

Setting Up Alertmanager

1

Install Alertmanager

# Download from prometheus.io/download
# Extract and add to PATH
alertmanager --version
2

Configure notifications

Edit alertmanager.yml to specify notification receivers:
route:
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications'

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#database-alerts'
        text: 'Summary: {{ .CommonAnnotations.summary }}\nDescription: {{ .CommonAnnotations.description }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
3

Start Alertmanager

alertmanager --config.file=alertmanager.yml
Access the Alertmanager UI at http://localhost:9093

Key Metrics to Monitor

Node Health

Monitor whether nodes are online and responsive:
  • Metric: Check node status on DB Console Cluster Overview page
  • Alert: Node has been down for 15+ minutes
  • Prometheus metric: sys_uptime (node uptime in seconds)
# Alert when node is down
up{job="cockroachdb"} == 0
Monitor CPU consumption by CockroachDB processes:
  • Healthy range: 30-80% sustained usage
  • Warning: Consistently >80% indicates potential CPU starvation
  • Prometheus metric: sys_cpu_combined_percent_normalized
# Alert on high CPU usage
rate(sys_cpu_combined_percent_normalized[5m]) > 80
Track memory allocated to SQL and storage:
  • SQL Memory: Monitor sql.mem.current for active query memory
  • Warning: SQL memory >75% of --max-sql-memory indicates potential OOM risk
  • Prometheus metric: sys_rss (resident set size)
# Alert on high memory usage
sys_rss / sys_totalmem > 0.9

Performance Metrics

Track time between query receipt and execution completion:
  • P99 latency: 99th percentile response time
  • Healthy: Under 100ms for simple queries
  • Prometheus metric: sql_exec_latency
# Alert on high P99 latency
histogram_quantile(0.99, rate(sql_exec_latency_bucket[5m])) > 1000
Monitor query throughput:
  • Metric: Total SELECT, INSERT, UPDATE, DELETE statements per second
  • Use: Identify traffic patterns and capacity planning
  • Prometheus metric: sql_select_count, sql_insert_count, etc.
# Total QPS
sum(rate(sql_select_count[5m])) + 
sum(rate(sql_insert_count[5m])) + 
sum(rate(sql_update_count[5m])) + 
sum(rate(sql_delete_count[5m]))

Storage Health

Monitor available storage space:
  • Warning: Under 15% free space
  • Critical: Under 10% free space (nodes may shut down)
  • Prometheus metric: capacity_available, capacity_used
# Alert on low disk space
(capacity_available / (capacity_available + capacity_used)) < 0.15
Monitor Log-Structured Merge-tree health:
  • Healthy: Level 0 files under 20
  • Warning: Level 0 files 20-100 (compaction falling behind)
  • Critical: Level 0 files over 100 (inverted LSM)
  • Prometheus metric: storage_l0_sublevels
# Alert on unhealthy LSM
storage_l0_sublevels > 20
Track disk reads per logical SQL statement:
  • Healthy: Under 10
  • Warning: 10-20 (compaction may be struggling)
  • Critical: Over 20 (indicates LSM health issues)
  • Prometheus metric: storage_read_amplification

Replication Status

Ranges with fewer replicas than configured:
  • Healthy: 0 under-replicated ranges
  • Warning: Any under-replicated ranges indicate data at risk
  • Prometheus metric: ranges_underreplicated
# Alert on under-replicated ranges
sum(ranges_underreplicated) > 0
Ranges that cannot serve requests:
  • Critical: Any unavailable ranges require immediate attention
  • Prometheus metric: ranges_unavailable
# Alert on unavailable ranges
sum(ranges_unavailable) > 0

Health Check Endpoints

CockroachDB provides HTTP endpoints for health checks:

/health Endpoint

Basic node liveness check:
curl http://localhost:8080/health
  • Returns: HTTP 200 if node is running
  • Returns: Connection refused if node is down

/health?ready=1 Endpoint

Node readiness check for load balancers:
curl http://localhost:8080/health?ready=1
  • Returns: HTTP 200 if node is ready to serve traffic
  • Returns: HTTP 503 if node is draining or cannot reach cluster majority
Use /health?ready=1 in load balancer health checks to automatically route traffic away from nodes during rolling upgrades or maintenance.

Grafana Dashboards

Visualize metrics with Grafana dashboards:
1

Install Grafana

Download and install from grafana.com/grafana/download
2

Add Prometheus datasource

Configure Prometheus as a datasource in Grafana:
  • Name: Prometheus
  • Type: Prometheus
  • URL: http://localhost:9090
  • Access: Direct
3

Import CockroachDB dashboards

Download and import pre-built dashboards:
# Runtime dashboard (uptime, memory, CPU)
curl -O https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/grafana-dashboards/by-cluster/runtime.json

# Storage dashboard (storage availability)
curl -O https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/grafana-dashboards/by-cluster/storage.json

# SQL dashboard (queries and transactions)
curl -O https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/grafana-dashboards/by-cluster/sql.json

# Replication dashboard (replica operations)
curl -O https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/grafana-dashboards/by-cluster/replication.json
Import these JSON files in Grafana UI.

Best Practices

  • Configure alerts for critical metrics (unavailable ranges, node down, low disk space)
  • Set appropriate thresholds to avoid alert fatigue
  • Use different notification channels for different severity levels
  • Test alert notifications regularly
  • Document runbooks for common alert scenarios
  • Store high-resolution metrics for recent data (10s granularity for 1-7 days)
  • Use downsampled metrics for historical data (5m granularity for 30-90 days)
  • Archive long-term metrics to cold storage if needed for compliance
  • Plan storage capacity based on retention requirements
  • Create separate dashboards for different teams (SRE, developers, management)
  • Include business metrics alongside technical metrics
  • Use consistent time ranges across related dashboards
  • Add annotations for deployments and configuration changes
  • Keep dashboards simple and focused on actionable insights

Build docs developers (and LLMs) love