Overview
CockroachDB exposes hundreds of metrics for monitoring cluster health, performance, and resource usage. These metrics are available through the Admin UI, Prometheus endpoints, and SQL queries.Metrics Architecture
CockroachDB metrics are organized into several categories:- Node metrics: Per-node system resources and health
- Store metrics: Per-store storage and replication statistics
- SQL metrics: Query execution and connection statistics
- Replication metrics: Raft consensus and replica health
- Storage engine metrics: LSM tree and compaction statistics
Accessing Metrics
Prometheus Endpoint
The primary metrics endpoint exports Prometheus-format metrics:The
/_status/vars endpoint is unauthenticated and designed for external monitoring systems.Load Metrics Endpoint
Lightweight endpoint for basic health metrics:sys.cpu.user.percent: User CPU usagesys.cpu.sys.percent: System CPU usagesys.uptime: Node uptime in seconds
Admin UI
Visual metrics dashboard:SQL Queries
Query metrics via internal tables:Core Metrics Reference
Node Health Metrics
Liveness Metrics
Liveness Metrics
| Metric | Type | Description |
|---|---|---|
liveness_livenodes | Gauge | Number of live nodes in cluster |
liveness_heartbeatlatency | Histogram | Liveness heartbeat latency |
liveness_heartbeatfailures | Counter | Failed liveness heartbeats |
liveness_epochincrements | Counter | Liveness epoch increments |
System Resource Metrics
System Resource Metrics
| Metric | Type | Description |
|---|---|---|
sys.cpu.user.percent | Gauge | User CPU usage (0-100%) |
sys.cpu.sys.percent | Gauge | System CPU usage (0-100%) |
sys.rss | Gauge | Resident set size (bytes) |
sys.go.allocbytes | Gauge | Go allocated memory (bytes) |
sys.uptime | Gauge | Node uptime (seconds) |
sys.fd.open | Gauge | Open file descriptors |
sys.fd.softlimit | Gauge | File descriptor soft limit |
Disk I/O Metrics
Disk I/O Metrics
| Metric | Type | Description |
|---|---|---|
sys.host.disk.read.bytes | Counter | Disk bytes read |
sys.host.disk.write.bytes | Counter | Disk bytes written |
sys.host.disk.read.count | Counter | Disk read operations |
sys.host.disk.write.count | Counter | Disk write operations |
sys.host.disk.iopsinprogress | Gauge | I/O operations in progress |
sys.host.disk.weightediopsinprogress | Gauge | Weighted I/O queue length |
Network Metrics
Network Metrics
| Metric | Type | Description |
|---|---|---|
sys.host.net.recv.bytes | Counter | Network bytes received |
sys.host.net.send.bytes | Counter | Network bytes sent |
sys.host.net.recv.packets | Counter | Network packets received |
sys.host.net.send.packets | Counter | Network packets sent |
SQL Metrics
Connection Metrics
Connection Metrics
| Metric | Type | Description |
|---|---|---|
sql.conns | Gauge | Active SQL connections |
sql.new_conns | Counter | New SQL connections created |
sql.bytesIn | Counter | SQL bytes received |
sql.bytesOut | Counter | SQL bytes sent |
Query Execution Metrics
Query Execution Metrics
| Metric | Type | Description |
|---|---|---|
sql.query.count | Counter | Total queries executed |
sql.select.count | Counter | SELECT queries |
sql.insert.count | Counter | INSERT queries |
sql.update.count | Counter | UPDATE queries |
sql.delete.count | Counter | DELETE queries |
sql.ddl.count | Counter | DDL statements |
sql.exec.latency | Histogram | Query execution latency |
sql.service.latency | Histogram | End-to-end query latency |
Transaction Metrics
Transaction Metrics
| Metric | Type | Description |
|---|---|---|
sql.txn.begin.count | Counter | Transactions started |
sql.txn.commit.count | Counter | Transactions committed |
sql.txn.abort.count | Counter | Transactions aborted |
sql.txn.rollback.count | Counter | Transactions rolled back |
sql.txn.latency | Histogram | Transaction latency |
Storage Metrics
Capacity Metrics
Capacity Metrics
| Metric | Type | Description |
|---|---|---|
capacity | Gauge | Total storage capacity (bytes) |
capacity.available | Gauge | Available storage (bytes) |
capacity.used | Gauge | Used storage (bytes) |
livebytes | Gauge | Live data (bytes) |
sysbytes | Gauge | System data (bytes) |
valbytes | Gauge | Value bytes |
keybytes | Gauge | Key bytes |
intentbytes | Gauge | Intent bytes (uncommitted) |
LSM Metrics
LSM Metrics
| Metric | Type | Description |
|---|---|---|
storage.l0-num-files | Gauge | L0 SSTable count |
storage.l0-sublevels | Gauge | L0 sublevels |
storage.marked-for-compaction-files | Gauge | Files marked for compaction |
storage.compactions | Counter | Compaction operations |
storage.flushes | Counter | Memtable flushes |
storage.flush.bytes | Counter | Bytes flushed |
Read/Write Metrics
Read/Write Metrics
| Metric | Type | Description |
|---|---|---|
rocksdb.read.bytes | Counter | RocksDB bytes read |
rocksdb.write.bytes | Counter | RocksDB bytes written |
rocksdb.block.cache.hits | Counter | Block cache hits |
rocksdb.block.cache.misses | Counter | Block cache misses |
rocksdb.bloom.filter.prefix.useful | Counter | Bloom filter hits |
Replication Metrics
Range Metrics
Range Metrics
| Metric | Type | Description |
|---|---|---|
ranges | Gauge | Total ranges on node |
ranges.unavailable | Gauge | Unavailable ranges |
ranges.underreplicated | Gauge | Under-replicated ranges |
ranges.overreplicated | Gauge | Over-replicated ranges |
ranges.quiescent | Gauge | Quiescent ranges |
replicas | Gauge | Total replicas on node |
replicas.leaders | Gauge | Raft leader replicas |
replicas.leaseholders | Gauge | Lease holder replicas |
Raft Metrics
Raft Metrics
| Metric | Type | Description |
|---|---|---|
raft.commandsapplied | Counter | Raft commands applied |
raft.process.commandcommit.latency | Histogram | Command commit latency |
raft.process.logcommit.latency | Histogram | Log commit latency |
raft.ticks | Counter | Raft ticks |
raft.rcvd.heartbeat | Counter | Heartbeats received |
raft.enqueued.pending | Gauge | Pending Raft proposals |
Lease Metrics
Lease Metrics
| Metric | Type | Description |
|---|---|---|
leases.success | Counter | Successful lease acquisitions |
leases.error | Counter | Failed lease acquisitions |
leases.transfers.success | Counter | Successful lease transfers |
leases.transfers.error | Counter | Failed lease transfers |
leases.expiration | Counter | Lease expirations |
KV Metrics
KV Operation Metrics
KV Operation Metrics
| Metric | Type | Description |
|---|---|---|
txn.commits | Counter | KV transactions committed |
txn.aborts | Counter | KV transactions aborted |
txn.restarts | Histogram | Transaction restarts |
requests.slow.raft | Counter | Slow Raft proposals |
requests.slow.latch | Counter | Slow latch acquisitions |
requests.slow.lease | Counter | Slow lease acquisitions |
Metric Types
CockroachDB uses standard Prometheus metric types:Metric Type Descriptions
Metric Type Descriptions
Counter
- Monotonically increasing value
- Examples:
sql.query.count,txn.commits - Use
rate()orincrease()in Prometheus
- Current value that can increase or decrease
- Examples:
sql.conns,capacity.available - Use directly or with
avg_over_time()
- Distribution of values in buckets
- Examples:
sql.exec.latency,raft.process.commandcommit.latency - Use
histogram_quantile()for percentiles
Prometheus Queries
Common Query Patterns
Advanced Queries
Metric Labels
Metrics include labels for filtering and aggregation:| Label | Description | Example |
|---|---|---|
cluster | Cluster identifier | production |
node_id | Node identifier | 1, 2, 3 |
store | Store identifier | 1, 2 |
instance | Node address | node1:8080 |
job | Prometheus job name | cockroachdb |
Alerting Rules
Prometheus Alert Examples
SQL Metrics Queries
Query metrics via SQL:Best Practices
Metrics Best Practices
Metrics Best Practices
- Monitor all nodes: Aggregate metrics across entire cluster
- Use percentiles: p99 and p999 reveal tail latencies
- Set up alerting: Proactive alerts prevent outages
- Track baselines: Understand normal operating ranges
- Monitor trends: Detect gradual degradation
- Use labels: Filter and aggregate by node, store, etc.
- Avoid high cardinality: Don’t create metrics per-user or per-query
- Sample appropriately: 10-30s scrape interval is typical
Troubleshooting
High Query Latency
Investigate with these metrics:sql.exec.latency- Query execution timesql.service.latency- End-to-end latencyrequests.slow.latch- Latch contentionrequests.slow.raft- Raft consensus delays
Storage Issues
Monitor:capacity.available- Available disk spacestorage.l0-sublevels- Compaction backlogrocksdb.block.cache.hit-rate- Cache efficiencysys.host.disk.iopsinprogress- I/O queue depth
Replication Problems
Check:ranges.unavailable- Critical replication issuesranges.underreplicated- Missing replicasliveness.livenodes- Node availabilityleases.transfers.error- Lease transfer failures
See Also
- Monitoring - Monitoring setup and configuration
- Logging - Log collection and analysis
- Configuration - Performance tuning settings