Monitoring and Observability

Active monitoring of your CockroachDB cluster is critical for maintaining health and performance in production. This guide covers the monitoring and observability tools available for CockroachDB.

Built-in Monitoring Tools

CockroachDB includes several tools to help you monitor cluster workloads and performance.

If a cluster becomes unavailable, most built-in monitoring tools also become unavailable. Configure a third-party monitoring tool like Prometheus or Datadog to collect metrics periodically from the Prometheus endpoint. These metrics are stored outside the cluster and can help troubleshoot what led to an outage.

DB Console

The DB Console provides a web UI for monitoring cluster health and performance. Access it at http://<node-host>:8080 (default port). Key features:

Real-time cluster metrics and time-series data
Node status and health information
SQL statement and transaction statistics
Storage and replication status
Hardware resource utilization

Metrics in DB Console are collected every 10 minutes by default and stored within the cluster. Data is retained at 10-second granularity for 10 days, and at 30-minute granularity for 90 days.

Prometheus Endpoint

Each node exports time-series metrics in Prometheus format at:

http://<node-host>:8080/_status/vars (legacy)
http://<node-host>:8080/metrics (v25.3+)

You can scrape these endpoints with Prometheus or other monitoring tools to:

Collect metrics at custom intervals
Store historical data according to your retention requirements
Create custom dashboards and alerts
Access metrics even when the cluster is unavailable

Setting Up Prometheus Monitoring

Install Prometheus

Download the Prometheus tarball for your OS from prometheus.io/download
Extract the binary and add it to your system PATH
Verify installation:

prometheus --version

Configure Prometheus

Create a prometheus.yml configuration file:

global:
  scrape_interval: 10s
  external_labels:
    cluster: 'cockroachdb-prod'

scrape_configs:
  - job_name: 'cockroachdb'
    metrics_path: '/_status/vars'
    scheme: 'http'  # Use 'https' for secure clusters
    static_configs:
      - targets: 
        - 'node1:8080'
        - 'node2:8080'
        - 'node3:8080'

rule_files:
  - 'rules/aggregation.rules.yml'
  - 'rules/alerts.rules.yml'

For production clusters, update the targets field with actual hostnames and ports for all nodes. Ensure your network allows TCP communication on the specified ports.

Download alerting rules

CockroachDB provides pre-built Prometheus rules:

mkdir rules
cd rules

# Download aggregation rules
curl -O https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/rules/aggregation.rules.yml

# Download alerting rules
curl -O https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/rules/alerts.rules.yml

Start Prometheus

prometheus --config.file=prometheus.yml

Access the Prometheus UI at http://localhost:9090

Setting Up Alertmanager

Install Alertmanager

# Download from prometheus.io/download
# Extract and add to PATH
alertmanager --version

Configure notifications

Edit alertmanager.yml to specify notification receivers:

route:
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications'

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#database-alerts'
        text: 'Summary: {{ .CommonAnnotations.summary }}\nDescription: {{ .CommonAnnotations.description }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'

Start Alertmanager

alertmanager --config.file=alertmanager.yml

Access the Alertmanager UI at http://localhost:9093

Key Metrics to Monitor

Node Health

Node status and availability

Monitor whether nodes are online and responsive:

Metric: Check node status on DB Console Cluster Overview page
Alert: Node has been down for 15+ minutes
Prometheus metric: sys_uptime (node uptime in seconds)

# Alert when node is down
up{job="cockroachdb"} == 0

CPU utilization

Monitor CPU consumption by CockroachDB processes:

Healthy range: 30-80% sustained usage
Warning: Consistently >80% indicates potential CPU starvation
Prometheus metric: sys_cpu_combined_percent_normalized

# Alert on high CPU usage
rate(sys_cpu_combined_percent_normalized[5m]) > 80

Memory usage

Track memory allocated to SQL and storage:

SQL Memory: Monitor sql.mem.current for active query memory
Warning: SQL memory >75% of --max-sql-memory indicates potential OOM risk
Prometheus metric: sys_rss (resident set size)

# Alert on high memory usage
sys_rss / sys_totalmem > 0.9

Performance Metrics

SQL query latency

Track time between query receipt and execution completion:

P99 latency: 99th percentile response time
Healthy: Under 100ms for simple queries
Prometheus metric: sql_exec_latency

# Alert on high P99 latency
histogram_quantile(0.99, rate(sql_exec_latency_bucket[5m])) > 1000

Queries per second (QPS)

Monitor query throughput:

Metric: Total SELECT, INSERT, UPDATE, DELETE statements per second
Use: Identify traffic patterns and capacity planning
Prometheus metric: sql_select_count, sql_insert_count, etc.

# Total QPS
sum(rate(sql_select_count[5m])) + 
sum(rate(sql_insert_count[5m])) + 
sum(rate(sql_update_count[5m])) + 
sum(rate(sql_delete_count[5m]))

Storage Health

Disk capacity

Monitor available storage space:

Warning: Under 15% free space
Critical: Under 10% free space (nodes may shut down)
Prometheus metric: capacity_available, capacity_used

# Alert on low disk space
(capacity_available / (capacity_available + capacity_used)) < 0.15

LSM health

Monitor Log-Structured Merge-tree health:

Healthy: Level 0 files under 20
Warning: Level 0 files 20-100 (compaction falling behind)
Critical: Level 0 files over 100 (inverted LSM)
Prometheus metric: storage_l0_sublevels

# Alert on unhealthy LSM
storage_l0_sublevels > 20

Read amplification

Track disk reads per logical SQL statement:

Healthy: Under 10
Warning: 10-20 (compaction may be struggling)
Critical: Over 20 (indicates LSM health issues)
Prometheus metric: storage_read_amplification

Replication Status

Under-replicated ranges

Ranges with fewer replicas than configured:

Healthy: 0 under-replicated ranges
Warning: Any under-replicated ranges indicate data at risk
Prometheus metric: ranges_underreplicated

# Alert on under-replicated ranges
sum(ranges_underreplicated) > 0

Unavailable ranges

Ranges that cannot serve requests:

Critical: Any unavailable ranges require immediate attention
Prometheus metric: ranges_unavailable

# Alert on unavailable ranges
sum(ranges_unavailable) > 0

Health Check Endpoints

CockroachDB provides HTTP endpoints for health checks:

`/health` Endpoint

Basic node liveness check:

curl http://localhost:8080/health

Returns: HTTP 200 if node is running
Returns: Connection refused if node is down

`/health?ready=1` Endpoint

Node readiness check for load balancers:

curl http://localhost:8080/health?ready=1

Returns: HTTP 200 if node is ready to serve traffic
Returns: HTTP 503 if node is draining or cannot reach cluster majority

Use /health?ready=1 in load balancer health checks to automatically route traffic away from nodes during rolling upgrades or maintenance.

Grafana Dashboards

Visualize metrics with Grafana dashboards:

Install Grafana

Download and install from grafana.com/grafana/download

Add Prometheus datasource

Configure Prometheus as a datasource in Grafana:

Name: Prometheus
Type: Prometheus
URL: http://localhost:9090
Access: Direct

Import CockroachDB dashboards

Download and import pre-built dashboards:

# Runtime dashboard (uptime, memory, CPU)
curl -O https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/grafana-dashboards/by-cluster/runtime.json

# Storage dashboard (storage availability)
curl -O https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/grafana-dashboards/by-cluster/storage.json

# SQL dashboard (queries and transactions)
curl -O https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/grafana-dashboards/by-cluster/sql.json

# Replication dashboard (replica operations)
curl -O https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/grafana-dashboards/by-cluster/replication.json

Import these JSON files in Grafana UI.

Best Practices

Alerting strategy

Configure alerts for critical metrics (unavailable ranges, node down, low disk space)
Set appropriate thresholds to avoid alert fatigue
Use different notification channels for different severity levels
Test alert notifications regularly
Document runbooks for common alert scenarios

Metric retention

Store high-resolution metrics for recent data (10s granularity for 1-7 days)
Use downsampled metrics for historical data (5m granularity for 30-90 days)
Archive long-term metrics to cold storage if needed for compliance
Plan storage capacity based on retention requirements

Dashboard organization

Create separate dashboards for different teams (SRE, developers, management)
Include business metrics alongside technical metrics
Use consistent time ranges across related dashboards
Add annotations for deployments and configuration changes
Keep dashboards simple and focused on actionable insights

Get Started

Architecture

Core Concepts

Deployment

SQL Reference

Operations

Migration

Developer Guide

Monitoring and Observability

Built-in Monitoring Tools

DB Console

Prometheus Endpoint

Setting Up Prometheus Monitoring

Setting Up Alertmanager

Key Metrics to Monitor

Node Health

Performance Metrics

Storage Health

Replication Status

Health Check Endpoints

`/health` Endpoint

`/health?ready=1` Endpoint

Grafana Dashboards

Best Practices

Build docs developers (and LLMs) love

Get Started

Architecture

Core Concepts

Deployment

SQL Reference

Operations

Migration

Developer Guide

​Built-in Monitoring Tools

​DB Console

​Prometheus Endpoint

​Setting Up Prometheus Monitoring

​Setting Up Alertmanager

​Key Metrics to Monitor

​Node Health

​Performance Metrics

​Storage Health

​Replication Status

​Health Check Endpoints

​/health Endpoint

​/health?ready=1 Endpoint

​Grafana Dashboards

​Best Practices

Build docs developers (and LLMs) love

Built-in Monitoring Tools

DB Console

Prometheus Endpoint

Setting Up Prometheus Monitoring

Setting Up Alertmanager

Key Metrics to Monitor

Node Health

Performance Metrics

Storage Health

Replication Status

Health Check Endpoints

`/health` Endpoint

`/health?ready=1` Endpoint

Grafana Dashboards

Best Practices