Why RocksDB?
CockroachDB chose RocksDB for several compelling reasons:Performance
Optimized for high write throughput and low read latency on SSDs.
LSM-Tree Architecture
Log-Structured Merge Tree design provides efficient writes and background compaction.
Production-Proven
Used by Facebook, LinkedIn, and many other large-scale systems.
Rich Feature Set
Supports prefix iteration, column families, snapshots, and custom comparators.
RocksDB is a variant of Google’s LevelDB with improvements for multi-threaded workloads and better configurability.
Architecture Overview
Store Organization
Node to Store Mapping
Fromdocs/design.md:
Nodes contain one or more stores. Each store should be placed on a unique disk. Internally, each store contains a single instance of RocksDB with a block cache shared amongst all of the stores in a node.
Range Replicas in Stores
Each store:- Contains multiple range replicas
- Shares a single RocksDB instance across all ranges
- Uses key prefixes to distinguish range data
- Maintains store-local metadata (unreplicated)
Storage Engine Interface
CockroachDB abstracts storage through theEngine interface defined in pkg/storage/engine.go:
Why an abstraction layer?
Why an abstraction layer?
- Testing: Can swap in in-memory implementations for tests
- Flexibility: Easier to experiment with different storage engines
- Portability: Storage-specific logic contained in wrapper layer
- Feature additions: Can add MVCC, encryption, etc. transparently
Multi-Version Concurrency Control (MVCC)
CockroachDB implements MVCC by storing multiple timestamped versions of each key, allowing lock-free reads of historical data.
MVCC Keys
Frompkg/storage/engine_key.go:
- Key: User-specified key bytes
- Timestamp: Hybrid Logical Clock timestamp (wall time + logical)
- Timestamp encoded in descending order for efficient iteration
Versioned Values
From the design document:Cockroach maintains historical versions of values by storing them with associated commit timestamps. Reads and scans can specify a snapshot time to return the most recent writes prior to the snapshot timestamp.Example:
Read at t=350 returns “Alicia Smith”
MVCC Operations
MVCCGet
MVCCGet
Read the value of a key at a specific timestamp:Seeks to
MVCCKey{key, timestamp} and reads first version ≤ timestamp.MVCCPut
MVCCPut
Write a new version of a key:Writes
MVCCKey{key, timestamp} -> value. Does not delete old versions.MVCCDelete
MVCCDelete
Delete a key by writing a tombstone:Writes an empty value at the timestamp. Old versions remain for historical reads.
Garbage Collection
From the design document:Older versions of values are garbage collected by the system during compaction according to a user-specified expiration interval. In order to support long-running scans (e.g. for MapReduce), all versions have a minimum expiration.
Key Space Organization
CockroachDB uses key prefixes to organize different types of data:System Keys
Fromdocs/design.md:
Global keys (\x02..., \x03...):
- Meta1 range metadata:
\x02<key>→ range info - Meta2 range metadata:
\x03<key>→ range info - Cluster-wide allocators and configuration
- Store identity
- Store-specific metadata
- Not replicated via Raft
- Transaction records:
\x01k<key>txn-<txnID> - Range metadata associated with global keys
- Range lease state
- Abort span entries
- Updated via Raft operations
- Raft state
- Raft log entries
- Local to each replica
Table Data Keys
From the SQL section ofdocs/design.md:
RocksDB Integration
LSM Tree Structure
Write Amplification
Read Amplification
Reads may need to check multiple levels:Compaction Strategies
RocksDB supports multiple compaction strategies: Level Compaction (default for CockroachDB):- Each level contains non-overlapping SSTables (except L0)
- Size of each level grows exponentially
- Good space efficiency
- More write amplification
- All files at same level
- Periodic full compaction
- Less write amplification
- More space amplification
Batch Operations
Frompkg/storage/batch.go:
Snapshots
Snapshots provide a consistent view of the database at a point in time, useful for backups and long-running scans.
- RocksDB maintains reference counts on SSTables
- Snapshot prevents compaction of referenced data
- Minimal overhead for short-lived snapshots
Performance Tuning
Block Cache
Configuration
Configuration
Shared across all stores in a node:Default: 25% of system memory
Benefits
Benefits
- Reduces read amplification
- Caches frequently accessed blocks
- Shared across all ranges in stores
- LRU eviction policy
Write Buffering
MemTable Size: Larger MemTables reduce flush frequency but use more memory Write Buffer Count: Multiple MemTables allow concurrent flushing Trade-off: Memory usage vs. write amplificationCompaction Threads
Storage Metrics
Key metrics exposed:Disk Usage
Total bytes stored per store
Compaction Stats
Bytes read/written during compaction
Read/Write Throughput
Operations per second and bytes/sec
LSM Health
Number of SSTables per level
- CockroachDB Admin UI
- Prometheus metrics endpoint
SHOW RANGESSQL command
Backup and Restore
Storage layer supports efficient backup: Full Backup:Backups are consistent at the MVCC timestamp level, ensuring transactional consistency without stopping writes.
Encryption at Rest
CockroachDB can encrypt stored data:- Custom RocksDB Env
- Encrypts before writing to disk
- Decrypts when reading
- Key management via external KMS
Implementation Files
Key source locations: Core Storage:pkg/storage/engine.go- Engine interfacepkg/storage/batch.go- Batch operationspkg/storage/pebble.go- Pebble implementation (RocksDB alternative)
pkg/storage/mvcc.go- MVCC operationspkg/storage/engine_key.go- MVCC key encoding
c-deps/- RocksDB C++ librarypkg/storage/- Go wrapper and integration
Further Reading
Replication Layer
How Raft uses storage
Transaction Layer
MVCC and transactions