RocksDB Architecture

RocksDB is a high-performance embedded key-value store optimized for fast storage, particularly flash drives. Built on LevelDB, it implements a Log-Structured Merge (LSM) tree design that provides excellent write performance and configurable trade-offs between write, read, and space amplification.

Core Components

RocksDB’s architecture consists of several key components working together:

MemTable

The MemTable is an in-memory data structure that holds the most recent writes. When you write data to RocksDB, it first goes into the MemTable:

#include <rocksdb/db.h>

rocksdb::DB* db;
rocksdb::Options options;
options.create_if_missing = true;
options.write_buffer_size = 64 << 20; // 64MB MemTable

rocksdb::Status status = rocksdb::DB::Open(options, "/tmp/testdb", &db);

// Writes go to MemTable first
db->Put(rocksdb::WriteOptions(), "key1", "value1");

MemTables are backed by a Write-Ahead Log (WAL) to ensure durability. If the process crashes, writes can be recovered from the WAL.

SST Files (Sorted String Tables)

When a MemTable reaches its size limit, it’s flushed to disk as an immutable SST file. SST files are organized into levels:

Level 0 (L0): Contains recently flushed MemTables. Files may have overlapping key ranges.
Level 1-N: Organized files with non-overlapping key ranges within each level.

// Configure level structure
options.max_bytes_for_level_base = 256 << 20; // 256MB for L1
options.max_bytes_for_level_multiplier = 10;  // Each level is 10x larger

Block Cache

The block cache stores frequently accessed data blocks from SST files in memory, improving read performance:

#include <rocksdb/cache.h>

// Create a 512MB LRU cache
std::shared_ptr<rocksdb::Cache> cache = rocksdb::NewLRUCache(512 << 20);

rocksdb::BlockBasedTableOptions table_options;
table_options.block_cache = cache;

options.table_factory.reset(
    rocksdb::NewBlockBasedTableFactory(table_options));

LSM Tree Design

RocksDB uses an LSM tree structure that optimizes for write performance:

Write Path
Read Path

Write to WAL for durability
Insert into MemTable
Return success to client
Background flush when MemTable is full
Background compaction to organize data

Amplification Trade-offs

RocksDB provides flexible trade-offs between three amplification factors:

Write Amplification Factor (WAF)

The ratio of bytes written to storage versus bytes written by the application. Compaction rewrites data multiple times as it moves through levels.Typical values: 10-30x for level compaction, 2-5x for universal compaction

Read Amplification Factor (RAF)

The number of disk reads required to satisfy a query. More levels and overlapping files increase read amplification.Typical values: 1-10 disk reads depending on configuration

Space Amplification Factor (SAF)

The ratio of database size on disk versus actual data size. Obsolete data isn’t immediately removed until compaction.Typical values: 1.1-2x depending on compaction strategy

Multi-threaded Compaction

RocksDB supports multi-threaded compactions for improved performance with large datasets:

// Enable parallel compactions
options.max_background_compactions = 4;
options.max_background_flushes = 2;

// Modern unified thread pool (recommended)
options.max_background_jobs = 6; // Total background threads

For databases storing multiple terabytes, multi-threaded compaction is essential for keeping up with write throughput and maintaining read performance.

Column Families

RocksDB supports multiple column families within a single database, each with independent configuration:

#include <rocksdb/db.h>
#include <vector>

rocksdb::DB* db;
std::vector<rocksdb::ColumnFamilyHandle*> handles;

// Open with multiple column families
std::vector<rocksdb::ColumnFamilyDescriptor> column_families;
column_families.push_back(rocksdb::ColumnFamilyDescriptor(
    rocksdb::kDefaultColumnFamilyName, rocksdb::ColumnFamilyOptions()));
column_families.push_back(rocksdb::ColumnFamilyDescriptor(
    "my_cf", rocksdb::ColumnFamilyOptions()));

rocksdb::Status s = rocksdb::DB::Open(
    rocksdb::DBOptions(), "/tmp/testdb", column_families, &handles, &db);

Durability and Recovery

RocksDB ensures durability through the Write-Ahead Log:

rocksdb::WriteOptions write_options;

// Wait for WAL sync (durable write)
write_options.sync = true;
db->Put(write_options, "key", "value");

// Fast write (may lose data on crash)
write_options.sync = false;
db->Put(write_options, "key", "value");

On database restart, RocksDB replays the WAL to recover any writes that were in MemTables but not yet flushed to SST files.

Performance Characteristics

RocksDB is optimized for:

High write throughput: Sequential writes to WAL and MemTable
Fast point lookups: Bloom filters and block cache
Efficient range scans: Sorted data within SST files
Large datasets: Multi-threaded compaction handles terabytes of data
Flash storage: Designed for SSDs and NVMe drives

Next Steps

LSM Tree Design

Deep dive into how LSM trees work

Column Families

Learn about organizing data with column families

Compaction

Understand compaction strategies

Write-Ahead Log

Explore WAL for durability

Get Started

Core Concepts

Developer Guide

Advanced Topics

Language Bindings

Tools & Utilities

RocksDB Architecture

Core Components

MemTable

SST Files (Sorted String Tables)

Block Cache

LSM Tree Design

Amplification Trade-offs

Multi-threaded Compaction

Column Families

Durability and Recovery

Performance Characteristics

Next Steps

LSM Tree Design

Column Families

Compaction

Write-Ahead Log

Build docs developers (and LLMs) love

Get Started

Core Concepts

Developer Guide

Advanced Topics

Language Bindings

Tools & Utilities

​Core Components

​MemTable

​SST Files (Sorted String Tables)

​Block Cache

​LSM Tree Design

​Amplification Trade-offs

​Multi-threaded Compaction

​Column Families

​Durability and Recovery

​Performance Characteristics

​Next Steps

LSM Tree Design

Column Families

Compaction

Write-Ahead Log

Build docs developers (and LLMs) love

Core Components

MemTable

SST Files (Sorted String Tables)

Block Cache

LSM Tree Design

Amplification Trade-offs

Multi-threaded Compaction

Column Families

Durability and Recovery

Performance Characteristics

Next Steps