Skip to main content
RocksDB is a high-performance embedded key-value store optimized for fast storage, particularly flash drives. Built on LevelDB, it implements a Log-Structured Merge (LSM) tree design that provides excellent write performance and configurable trade-offs between write, read, and space amplification.

Core Components

RocksDB’s architecture consists of several key components working together:

MemTable

The MemTable is an in-memory data structure that holds the most recent writes. When you write data to RocksDB, it first goes into the MemTable:
#include <rocksdb/db.h>

rocksdb::DB* db;
rocksdb::Options options;
options.create_if_missing = true;
options.write_buffer_size = 64 << 20; // 64MB MemTable

rocksdb::Status status = rocksdb::DB::Open(options, "/tmp/testdb", &db);

// Writes go to MemTable first
db->Put(rocksdb::WriteOptions(), "key1", "value1");
MemTables are backed by a Write-Ahead Log (WAL) to ensure durability. If the process crashes, writes can be recovered from the WAL.

SST Files (Sorted String Tables)

When a MemTable reaches its size limit, it’s flushed to disk as an immutable SST file. SST files are organized into levels:
  • Level 0 (L0): Contains recently flushed MemTables. Files may have overlapping key ranges.
  • Level 1-N: Organized files with non-overlapping key ranges within each level.
// Configure level structure
options.max_bytes_for_level_base = 256 << 20; // 256MB for L1
options.max_bytes_for_level_multiplier = 10;  // Each level is 10x larger

Block Cache

The block cache stores frequently accessed data blocks from SST files in memory, improving read performance:
#include <rocksdb/cache.h>

// Create a 512MB LRU cache
std::shared_ptr<rocksdb::Cache> cache = rocksdb::NewLRUCache(512 << 20);

rocksdb::BlockBasedTableOptions table_options;
table_options.block_cache = cache;

options.table_factory.reset(
    rocksdb::NewBlockBasedTableFactory(table_options));

LSM Tree Design

RocksDB uses an LSM tree structure that optimizes for write performance:
  1. Write to WAL for durability
  2. Insert into MemTable
  3. Return success to client
  4. Background flush when MemTable is full
  5. Background compaction to organize data

Amplification Trade-offs

RocksDB provides flexible trade-offs between three amplification factors:
The ratio of bytes written to storage versus bytes written by the application. Compaction rewrites data multiple times as it moves through levels.Typical values: 10-30x for level compaction, 2-5x for universal compaction
The number of disk reads required to satisfy a query. More levels and overlapping files increase read amplification.Typical values: 1-10 disk reads depending on configuration
The ratio of database size on disk versus actual data size. Obsolete data isn’t immediately removed until compaction.Typical values: 1.1-2x depending on compaction strategy

Multi-threaded Compaction

RocksDB supports multi-threaded compactions for improved performance with large datasets:
// Enable parallel compactions
options.max_background_compactions = 4;
options.max_background_flushes = 2;

// Modern unified thread pool (recommended)
options.max_background_jobs = 6; // Total background threads
For databases storing multiple terabytes, multi-threaded compaction is essential for keeping up with write throughput and maintaining read performance.

Column Families

RocksDB supports multiple column families within a single database, each with independent configuration:
#include <rocksdb/db.h>
#include <vector>

rocksdb::DB* db;
std::vector<rocksdb::ColumnFamilyHandle*> handles;

// Open with multiple column families
std::vector<rocksdb::ColumnFamilyDescriptor> column_families;
column_families.push_back(rocksdb::ColumnFamilyDescriptor(
    rocksdb::kDefaultColumnFamilyName, rocksdb::ColumnFamilyOptions()));
column_families.push_back(rocksdb::ColumnFamilyDescriptor(
    "my_cf", rocksdb::ColumnFamilyOptions()));

rocksdb::Status s = rocksdb::DB::Open(
    rocksdb::DBOptions(), "/tmp/testdb", column_families, &handles, &db);

Durability and Recovery

RocksDB ensures durability through the Write-Ahead Log:
rocksdb::WriteOptions write_options;

// Wait for WAL sync (durable write)
write_options.sync = true;
db->Put(write_options, "key", "value");

// Fast write (may lose data on crash)
write_options.sync = false;
db->Put(write_options, "key", "value");
On database restart, RocksDB replays the WAL to recover any writes that were in MemTables but not yet flushed to SST files.

Performance Characteristics

RocksDB is optimized for:
  • High write throughput: Sequential writes to WAL and MemTable
  • Fast point lookups: Bloom filters and block cache
  • Efficient range scans: Sorted data within SST files
  • Large datasets: Multi-threaded compaction handles terabytes of data
  • Flash storage: Designed for SSDs and NVMe drives

Next Steps

LSM Tree Design

Deep dive into how LSM trees work

Column Families

Learn about organizing data with column families

Compaction

Understand compaction strategies

Write-Ahead Log

Explore WAL for durability

Build docs developers (and LLMs) love