Skip to main content

Overview

CockroachDB exposes hundreds of metrics for monitoring cluster health, performance, and resource usage. These metrics are available through the Admin UI, Prometheus endpoints, and SQL queries.

Metrics Architecture

CockroachDB metrics are organized into several categories:
  • Node metrics: Per-node system resources and health
  • Store metrics: Per-store storage and replication statistics
  • SQL metrics: Query execution and connection statistics
  • Replication metrics: Raft consensus and replica health
  • Storage engine metrics: LSM tree and compaction statistics

Accessing Metrics

Prometheus Endpoint

The primary metrics endpoint exports Prometheus-format metrics:
curl http://localhost:8080/_status/vars
The /_status/vars endpoint is unauthenticated and designed for external monitoring systems.

Load Metrics Endpoint

Lightweight endpoint for basic health metrics:
curl http://localhost:8080/_status/load
Returns:
  • sys.cpu.user.percent: User CPU usage
  • sys.cpu.sys.percent: System CPU usage
  • sys.uptime: Node uptime in seconds

Admin UI

Visual metrics dashboard:
http://localhost:8080/#/metrics

SQL Queries

Query metrics via internal tables:
SELECT * FROM crdb_internal.kv_store_status;
SELECT * FROM crdb_internal.node_metrics;

Core Metrics Reference

Node Health Metrics

MetricTypeDescription
liveness_livenodesGaugeNumber of live nodes in cluster
liveness_heartbeatlatencyHistogramLiveness heartbeat latency
liveness_heartbeatfailuresCounterFailed liveness heartbeats
liveness_epochincrementsCounterLiveness epoch increments
MetricTypeDescription
sys.cpu.user.percentGaugeUser CPU usage (0-100%)
sys.cpu.sys.percentGaugeSystem CPU usage (0-100%)
sys.rssGaugeResident set size (bytes)
sys.go.allocbytesGaugeGo allocated memory (bytes)
sys.uptimeGaugeNode uptime (seconds)
sys.fd.openGaugeOpen file descriptors
sys.fd.softlimitGaugeFile descriptor soft limit
MetricTypeDescription
sys.host.disk.read.bytesCounterDisk bytes read
sys.host.disk.write.bytesCounterDisk bytes written
sys.host.disk.read.countCounterDisk read operations
sys.host.disk.write.countCounterDisk write operations
sys.host.disk.iopsinprogressGaugeI/O operations in progress
sys.host.disk.weightediopsinprogressGaugeWeighted I/O queue length
MetricTypeDescription
sys.host.net.recv.bytesCounterNetwork bytes received
sys.host.net.send.bytesCounterNetwork bytes sent
sys.host.net.recv.packetsCounterNetwork packets received
sys.host.net.send.packetsCounterNetwork packets sent

SQL Metrics

MetricTypeDescription
sql.connsGaugeActive SQL connections
sql.new_connsCounterNew SQL connections created
sql.bytesInCounterSQL bytes received
sql.bytesOutCounterSQL bytes sent
MetricTypeDescription
sql.query.countCounterTotal queries executed
sql.select.countCounterSELECT queries
sql.insert.countCounterINSERT queries
sql.update.countCounterUPDATE queries
sql.delete.countCounterDELETE queries
sql.ddl.countCounterDDL statements
sql.exec.latencyHistogramQuery execution latency
sql.service.latencyHistogramEnd-to-end query latency
MetricTypeDescription
sql.txn.begin.countCounterTransactions started
sql.txn.commit.countCounterTransactions committed
sql.txn.abort.countCounterTransactions aborted
sql.txn.rollback.countCounterTransactions rolled back
sql.txn.latencyHistogramTransaction latency

Storage Metrics

MetricTypeDescription
capacityGaugeTotal storage capacity (bytes)
capacity.availableGaugeAvailable storage (bytes)
capacity.usedGaugeUsed storage (bytes)
livebytesGaugeLive data (bytes)
sysbytesGaugeSystem data (bytes)
valbytesGaugeValue bytes
keybytesGaugeKey bytes
intentbytesGaugeIntent bytes (uncommitted)
MetricTypeDescription
storage.l0-num-filesGaugeL0 SSTable count
storage.l0-sublevelsGaugeL0 sublevels
storage.marked-for-compaction-filesGaugeFiles marked for compaction
storage.compactionsCounterCompaction operations
storage.flushesCounterMemtable flushes
storage.flush.bytesCounterBytes flushed
MetricTypeDescription
rocksdb.read.bytesCounterRocksDB bytes read
rocksdb.write.bytesCounterRocksDB bytes written
rocksdb.block.cache.hitsCounterBlock cache hits
rocksdb.block.cache.missesCounterBlock cache misses
rocksdb.bloom.filter.prefix.usefulCounterBloom filter hits

Replication Metrics

MetricTypeDescription
rangesGaugeTotal ranges on node
ranges.unavailableGaugeUnavailable ranges
ranges.underreplicatedGaugeUnder-replicated ranges
ranges.overreplicatedGaugeOver-replicated ranges
ranges.quiescentGaugeQuiescent ranges
replicasGaugeTotal replicas on node
replicas.leadersGaugeRaft leader replicas
replicas.leaseholdersGaugeLease holder replicas
MetricTypeDescription
raft.commandsappliedCounterRaft commands applied
raft.process.commandcommit.latencyHistogramCommand commit latency
raft.process.logcommit.latencyHistogramLog commit latency
raft.ticksCounterRaft ticks
raft.rcvd.heartbeatCounterHeartbeats received
raft.enqueued.pendingGaugePending Raft proposals
MetricTypeDescription
leases.successCounterSuccessful lease acquisitions
leases.errorCounterFailed lease acquisitions
leases.transfers.successCounterSuccessful lease transfers
leases.transfers.errorCounterFailed lease transfers
leases.expirationCounterLease expirations

KV Metrics

MetricTypeDescription
txn.commitsCounterKV transactions committed
txn.abortsCounterKV transactions aborted
txn.restartsHistogramTransaction restarts
requests.slow.raftCounterSlow Raft proposals
requests.slow.latchCounterSlow latch acquisitions
requests.slow.leaseCounterSlow lease acquisitions

Metric Types

CockroachDB uses standard Prometheus metric types:
Counter
  • Monotonically increasing value
  • Examples: sql.query.count, txn.commits
  • Use rate() or increase() in Prometheus
Gauge
  • Current value that can increase or decrease
  • Examples: sql.conns, capacity.available
  • Use directly or with avg_over_time()
Histogram
  • Distribution of values in buckets
  • Examples: sql.exec.latency, raft.process.commandcommit.latency
  • Use histogram_quantile() for percentiles

Prometheus Queries

Common Query Patterns

# Queries per second
rate(sql_query_count[1m])

# By node
sum(rate(sql_query_count[1m])) by (node_id)

Advanced Queries

rate(sql_txn_abort_count[5m]) / rate(sql_txn_begin_count[5m])

Metric Labels

Metrics include labels for filtering and aggregation:
LabelDescriptionExample
clusterCluster identifierproduction
node_idNode identifier1, 2, 3
storeStore identifier1, 2
instanceNode addressnode1:8080
jobPrometheus job namecockroachdb

Alerting Rules

Prometheus Alert Examples

- alert: CockroachDBNodeDown
  expr: up{job="cockroachdb"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "CockroachDB node {{ $labels.instance }} is down"

SQL Metrics Queries

Query metrics via SQL:
SELECT
  node_id,
  store_id,
  (metrics->>'capacity.available')::BIGINT as available_bytes,
  (metrics->>'livebytes')::BIGINT as live_bytes,
  (metrics->>'replicas')::INT as replicas
FROM crdb_internal.kv_store_status;

Best Practices

  1. Monitor all nodes: Aggregate metrics across entire cluster
  2. Use percentiles: p99 and p999 reveal tail latencies
  3. Set up alerting: Proactive alerts prevent outages
  4. Track baselines: Understand normal operating ranges
  5. Monitor trends: Detect gradual degradation
  6. Use labels: Filter and aggregate by node, store, etc.
  7. Avoid high cardinality: Don’t create metrics per-user or per-query
  8. Sample appropriately: 10-30s scrape interval is typical

Troubleshooting

High Query Latency

Investigate with these metrics:
  • sql.exec.latency - Query execution time
  • sql.service.latency - End-to-end latency
  • requests.slow.latch - Latch contention
  • requests.slow.raft - Raft consensus delays

Storage Issues

Monitor:
  • capacity.available - Available disk space
  • storage.l0-sublevels - Compaction backlog
  • rocksdb.block.cache.hit-rate - Cache efficiency
  • sys.host.disk.iopsinprogress - I/O queue depth

Replication Problems

Check:
  • ranges.unavailable - Critical replication issues
  • ranges.underreplicated - Missing replicas
  • liveness.livenodes - Node availability
  • leases.transfers.error - Lease transfer failures

See Also

Build docs developers (and LLMs) love