Skip to main content
CockroachDB’s storage layer is built on RocksDB, a high-performance embedded key-value store based on LevelDB. This layer provides persistent storage with multi-version concurrency control (MVCC) and efficient compaction.

Why RocksDB?

CockroachDB chose RocksDB for several compelling reasons:

Performance

Optimized for high write throughput and low read latency on SSDs.

LSM-Tree Architecture

Log-Structured Merge Tree design provides efficient writes and background compaction.

Production-Proven

Used by Facebook, LinkedIn, and many other large-scale systems.

Rich Feature Set

Supports prefix iteration, column families, snapshots, and custom comparators.
RocksDB is a variant of Google’s LevelDB with improvements for multi-threaded workloads and better configurability.

Architecture Overview

Store Organization

Node to Store Mapping

From docs/design.md:
Nodes contain one or more stores. Each store should be placed on a unique disk. Internally, each store contains a single instance of RocksDB with a block cache shared amongst all of the stores in a node.
CockroachDB Node
├── Store 1 (SSD /dev/sda)
│   ├── RocksDB instance
│   ├── Range Replica 1
│   ├── Range Replica 5
│   └── Range Replica 9
├── Store 2 (SSD /dev/sdb)
│   ├── RocksDB instance
│   ├── Range Replica 2
│   ├── Range Replica 6
│   └── Range Replica 10
└── Store 3 (HDD /dev/sdc)
    ├── RocksDB instance
    ├── Range Replica 3
    ├── Range Replica 7
    └── Range Replica 11
Multiple stores per node allow better utilization of multiple disks and provide isolation between different storage tiers (SSD vs. HDD).

Range Replicas in Stores

More than one replica for a range will never be placed on the same store or even the same node.
Each store:
  • Contains multiple range replicas
  • Shares a single RocksDB instance across all ranges
  • Uses key prefixes to distinguish range data
  • Maintains store-local metadata (unreplicated)

Storage Engine Interface

CockroachDB abstracts storage through the Engine interface defined in pkg/storage/engine.go:
// Simplified from pkg/storage/engine.go
type Engine interface {
    // Core operations
    Get(key MVCCKey) ([]byte, error)
    Put(key MVCCKey, value []byte) error
    Delete(key MVCCKey) error
    
    // Iteration
    NewIterator(opts IterOptions) Iterator
    
    // Batch operations
    NewBatch() Batch
    ApplyBatchRepr(repr []byte) error
    
    // Snapshots
    NewSnapshot() Reader
    
    // Maintenance
    Flush() error
    Compact() error
}
  • Testing: Can swap in in-memory implementations for tests
  • Flexibility: Easier to experiment with different storage engines
  • Portability: Storage-specific logic contained in wrapper layer
  • Feature additions: Can add MVCC, encryption, etc. transparently

Multi-Version Concurrency Control (MVCC)

CockroachDB implements MVCC by storing multiple timestamped versions of each key, allowing lock-free reads of historical data.

MVCC Keys

From pkg/storage/engine_key.go:
// MVCCKey is a versioned key
type MVCCKey struct {
    Key       roachpb.Key  // Logical key
    Timestamp hlc.Timestamp // Version timestamp
}
Physical encoding:
[Key][Timestamp Suffix]
  • Key: User-specified key bytes
  • Timestamp: Hybrid Logical Clock timestamp (wall time + logical)
  • Timestamp encoded in descending order for efficient iteration

Versioned Values

From the design document:
Cockroach maintains historical versions of values by storing them with associated commit timestamps. Reads and scans can specify a snapshot time to return the most recent writes prior to the snapshot timestamp.
Example:
Key: /users/1/name

[/users/1/name @ t=100] -> "Alice"
[/users/1/name @ t=200] -> "Alice Smith"
[/users/1/name @ t=300] -> "Alicia Smith"
Read at t=150 returns “Alice” Read at t=250 returns “Alice Smith”
Read at t=350 returns “Alicia Smith”

MVCC Operations

Read the value of a key at a specific timestamp:
func MVCCGet(
    engine Reader,
    key roachpb.Key,
    timestamp hlc.Timestamp,
) (value roachpb.Value, err error)
Seeks to MVCCKey{key, timestamp} and reads first version ≤ timestamp.
Write a new version of a key:
func MVCCPut(
    engine ReadWriter,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    value roachpb.Value,
) error
Writes MVCCKey{key, timestamp} -> value. Does not delete old versions.
Delete a key by writing a tombstone:
func MVCCDelete(
    engine ReadWriter,
    key roachpb.Key,
    timestamp hlc.Timestamp,
) error
Writes an empty value at the timestamp. Old versions remain for historical reads.

Garbage Collection

From the design document:
Older versions of values are garbage collected by the system during compaction according to a user-specified expiration interval. In order to support long-running scans (e.g. for MapReduce), all versions have a minimum expiration.
1

Identify old versions

GC process scans for versions older than GC threshold
2

Respect minimum TTL

Honors minimum version retention for long-running scans
3

Delete during compaction

RocksDB compaction filter removes old versions
4

Update intent resolution

Ensures no active transactions reference old versions

Key Space Organization

CockroachDB uses key prefixes to organize different types of data:

System Keys

From docs/design.md: Global keys (\x02..., \x03...):
  • Meta1 range metadata: \x02<key> → range info
  • Meta2 range metadata: \x03<key> → range info
  • Cluster-wide allocators and configuration
Store local keys (unreplicated):
  • Store identity
  • Store-specific metadata
  • Not replicated via Raft
Range local keys:
  • Transaction records: \x01k<key>txn-<txnID>
  • Range metadata associated with global keys
Replicated Range ID local keys:
  • Range lease state
  • Abort span entries
  • Updated via Raft operations
Unreplicated Range ID local keys:
  • Raft state
  • Raft log entries
  • Local to each replica

Table Data Keys

From the SQL section of docs/design.md:
Database ID: 51
Table ID: 42
Primary key: "Apple"
Column ID: 69 (address), 66 (url)

Physical keys:
/51/42/Apple/69 -> "1 Infinite Loop, Cupertino, CA"
/51/42/Apple/66 -> "http://apple.com/"
Prefix compression in RocksDB makes this scheme efficient despite repeated prefixes.

RocksDB Integration

LSM Tree Structure

Write Path:
Client Write

┌─────────────┐
│  MemTable   │ (In-memory, sorted)
│   (Active)  │
└──────┬──────┘
       │ Full

┌─────────────┐
│  MemTable   │ (Being flushed)
│ (Immutable) │
└──────┬──────┘
       │ Flush

┌─────────────┐
│  L0 SSTable │ (On disk)
└──────┬──────┘
       │ Compaction

┌─────────────┐
│  L1 SSTable │
└──────┬──────┘
       │ Compaction

┌─────────────┐
│  L2-L6...   │
└─────────────┘

Write Amplification

LSM trees have inherent write amplification: data is written multiple times during compaction.Example:
  • Initial write: 1x (to MemTable)
  • Flush to L0: 1x
  • Compact to L1: 1x
  • Compact to L2+: 1x or more
Total: 4-10x write amplification typical

Read Amplification

Reads may need to check multiple levels:
1. Check MemTable (active)
2. Check MemTable (immutable)
3. Check L0 SSTables (may overlap)
4. Check L1 SSTable
5. Check L2 SSTable
... continue until found
Bloom filters reduce read amplification by quickly determining if a key might exist in an SSTable before performing expensive I/O.

Compaction Strategies

RocksDB supports multiple compaction strategies: Level Compaction (default for CockroachDB):
  • Each level contains non-overlapping SSTables (except L0)
  • Size of each level grows exponentially
  • Good space efficiency
  • More write amplification
Universal Compaction:
  • All files at same level
  • Periodic full compaction
  • Less write amplification
  • More space amplification

Batch Operations

From pkg/storage/batch.go:
type Batch interface {
    Engine
    Commit() error
    Repr() []byte
}
Batches provide atomic operations:
1

Create batch

batch := engine.NewBatch()
2

Add operations

batch.Put(key1, value1)
batch.Put(key2, value2)
batch.Delete(key3)
3

Commit atomically

err := batch.Commit()
All operations in batch are atomic - all succeed or all fail.

Snapshots

Snapshots provide a consistent view of the database at a point in time, useful for backups and long-running scans.
snap := engine.NewSnapshot()
defer snap.Close()

// Read from snapshot - sees consistent state
value, err := snap.Get(key)

// Meanwhile, writes continue to engine
engine.Put(key, newValue)

// Snapshot still sees old value
oldValue, _ := snap.Get(key)
Implementation:
  • RocksDB maintains reference counts on SSTables
  • Snapshot prevents compaction of referenced data
  • Minimal overhead for short-lived snapshots

Performance Tuning

Block Cache

Shared across all stores in a node:
// Simplified configuration
cache := NewRocksDBCache(cacheSize)
engine := NewRocksDB(config, cache)
Default: 25% of system memory
  • Reduces read amplification
  • Caches frequently accessed blocks
  • Shared across all ranges in stores
  • LRU eviction policy

Write Buffering

MemTable Size: Larger MemTables reduce flush frequency but use more memory Write Buffer Count: Multiple MemTables allow concurrent flushing Trade-off: Memory usage vs. write amplification

Compaction Threads

RocksDB supports background compaction threads. CockroachDB configures this based on core count to balance throughput and latency.

Storage Metrics

Key metrics exposed:

Disk Usage

Total bytes stored per store

Compaction Stats

Bytes read/written during compaction

Read/Write Throughput

Operations per second and bytes/sec

LSM Health

Number of SSTables per level
Monitor via:
  • CockroachDB Admin UI
  • Prometheus metrics endpoint
  • SHOW RANGES SQL command

Backup and Restore

Storage layer supports efficient backup: Full Backup:
// Create RocksDB checkpoint
checkpoint := engine.CreateCheckpoint(path)
Incremental Backup:
// Export SSTable files since timestamp
files := engine.ExportFilesModifiedSince(timestamp)
Backups are consistent at the MVCC timestamp level, ensuring transactional consistency without stopping writes.

Encryption at Rest

CockroachDB can encrypt stored data:
CockroachDB MVCC Layer
         |
         v
Encryption Layer (AES-256)
         |
         v
  RocksDB Engine
         |
         v
    Disk Storage
Implemented via:
  • Custom RocksDB Env
  • Encrypts before writing to disk
  • Decrypts when reading
  • Key management via external KMS

Implementation Files

Key source locations: Core Storage:
  • pkg/storage/engine.go - Engine interface
  • pkg/storage/batch.go - Batch operations
  • pkg/storage/pebble.go - Pebble implementation (RocksDB alternative)
MVCC:
  • pkg/storage/mvcc.go - MVCC operations
  • pkg/storage/engine_key.go - MVCC key encoding
RocksDB Wrapper:
  • c-deps/ - RocksDB C++ library
  • pkg/storage/ - Go wrapper and integration

Further Reading

Replication Layer

How Raft uses storage

Transaction Layer

MVCC and transactions

Build docs developers (and LLMs) love