LSM Tree Design

The Log-Structured Merge (LSM) tree is the foundational data structure behind RocksDB’s high performance. It optimizes for write-heavy workloads by batching writes in memory and organizing data hierarchically on disk.

What is an LSM Tree?

An LSM tree is a data structure that maintains key-value pairs across multiple levels of storage:

Memory (MemTable): Active write buffer
Disk (SST Files): Immutable sorted files organized in levels

Unlike B-trees which update in place, LSM trees append writes and periodically merge sorted runs. This design makes writes much faster but requires background compaction.

Write Path

When data is written to RocksDB, it follows this path:

#include <rocksdb/db.h>

rocksdb::DB* db;
rocksdb::Options options;
options.create_if_missing = true;

// Configure MemTable size
options.write_buffer_size = 64 << 20; // 64MB
options.max_write_buffer_number = 3;  // Max 3 MemTables in memory

rocksdb::DB::Open(options, "/tmp/testdb", &db);

// Write operation
rocksdb::WriteOptions write_opts;
rocksdb::Status s = db->Put(write_opts, "user:1001", "{\"name\":\"Alice\"}");

Step-by-Step Write Process

1. WAL Write
2. MemTable Insert
3. Flush to L0

First, the write is appended to the Write-Ahead Log (WAL) for durability:

rocksdb::WriteOptions write_opts;
write_opts.sync = true; // Force fsync for durability
write_opts.disableWAL = false; // Ensure WAL is enabled

db->Put(write_opts, "key", "value");

The WAL ensures data isn’t lost if the process crashes before MemTable is flushed.

Next, the key-value pair is inserted into the active MemTable:

// MemTable keeps data sorted by key
// Uses skip list by default for O(log n) inserts
options.memtable_factory.reset(rocksdb::NewSkipListFactory());

// Alternative: Hash skip list for point lookups
// options.memtable_factory.reset(
//     rocksdb::NewHashSkipListRepFactory());

The MemTable maintains sorted order for efficient iteration.

When the MemTable reaches its size limit, it becomes immutable and is flushed to disk as an L0 SST file:

// Trigger flush when MemTable is full
options.write_buffer_size = 64 << 20;

// Maximum immutable MemTables before blocking writes
options.max_write_buffer_number = 3;

// Manual flush
rocksdb::FlushOptions flush_opts;
flush_opts.wait = true;
db->Flush(flush_opts);

Level Structure

RocksDB organizes SST files into multiple levels, each with specific characteristics:

Level 0 (L0)

L0 contains recently flushed MemTables:

Files may have overlapping key ranges
Read queries must check all L0 files
Controlled by level0_file_num_compaction_trigger

// Configure L0 behavior
options.level0_file_num_compaction_trigger = 4; // Compact when 4 L0 files
options.level0_slowdown_writes_trigger = 20;    // Slow writes at 20 files
options.level0_stop_writes_trigger = 36;        // Stop writes at 36 files

Too many L0 files degrades read performance. RocksDB will slow or stop writes to allow compaction to catch up.

Levels 1 through N

Lower levels have non-overlapping files within each level:

// Configure level sizes
options.max_bytes_for_level_base = 256 << 20;    // 256MB for L1
options.max_bytes_for_level_multiplier = 10;     // Each level 10x larger

// L1: 256MB
// L2: 2.56GB
// L3: 25.6GB
// L4: 256GB
// etc.

Level Properties

Non-overlapping: Each key exists in at most one file per level
Sorted: Files are ordered by key range
Binary search: Can quickly locate the correct file for a key
Compaction targets: Files are compacted from Li to Li+1

Read Path

Reading data requires checking multiple locations:

std::string value;
rocksdb::ReadOptions read_opts;
rocksdb::Status s = db->Get(read_opts, "user:1001", &value);

if (s.ok()) {
  // Found the value
} else if (s.IsNotFound()) {
  // Key doesn't exist
}

Read Search Order

Active MemTable - Check current write buffer (O(log n))
Immutable MemTables - Check MemTables being flushed
Block Cache - Check if block is cached in memory
L0 SST Files - Check all L0 files (newest to oldest)
L1-LN SST Files - Binary search to find correct file per level

// Enable block cache for faster reads
std::shared_ptr<rocksdb::Cache> cache = rocksdb::NewLRUCache(512 << 20);

rocksdb::BlockBasedTableOptions table_options;
table_options.block_cache = cache;

// Enable bloom filters to skip files
table_options.filter_policy.reset(rocksdb::NewBloomFilterPolicy(10));

options.table_factory.reset(
    rocksdb::NewBlockBasedTableFactory(table_options));

Bloom filters can quickly determine if a key is NOT in an SST file, avoiding expensive disk reads. With a 10-bit filter, false positive rate is about 1%.

Range Scans

LSM trees efficiently support range queries:

// Create an iterator
rocksdb::Iterator* it = db->NewIterator(rocksdb::ReadOptions());

// Range scan: all users from 1000-2000
for (it->Seek("user:1000"); it->Valid(); it->Next()) {
  rocksdb::Slice key = it->key();
  if (key.compare("user:2000") > 0) break;
  
  std::cout << key.ToString() << ": " << it->value().ToString() << std::endl;
}

delete it;

Iterators merge-sort data from all levels, providing a consistent view of the database.

Merge Algorithms

RocksDB uses efficient merge algorithms to combine data from multiple levels:

Point Lookup
Range Scan

Search from newest to oldest data:

MemTable
L0 (all files, newest first)
L1 (binary search to file)
L2 (binary search to file)
…

Stop at first occurrence of key.

Optimizing LSM Performance

Write Optimization

// Larger MemTables reduce flush frequency
options.write_buffer_size = 128 << 20; // 128MB

// More MemTables reduce write stalls
options.max_write_buffer_number = 4;

// Increase L1 size to reduce compaction from L0
options.max_bytes_for_level_base = 512 << 20; // 512MB

Read Optimization

// Larger block cache improves read performance
std::shared_ptr<rocksdb::Cache> cache = rocksdb::NewLRUCache(2 << 30); // 2GB

rocksdb::BlockBasedTableOptions table_options;
table_options.block_cache = cache;

// Bloom filters reduce disk reads
table_options.filter_policy.reset(rocksdb::NewBloomFilterPolicy(10));

// Index blocks in cache for faster lookup
table_options.cache_index_and_filter_blocks = true;

Space Optimization

// Aggressive compaction reduces space amplification
options.max_bytes_for_level_multiplier = 8; // Smaller multiplier

// Enable compression
options.compression = rocksdb::kSnappyCompression;

// Use different compression for bottom level
options.bottommost_compression = rocksdb::kZSTD;

LSM vs B-Tree

Characteristic	LSM Tree	B-Tree
Write Performance	Excellent (sequential)	Good (random)
Read Performance	Good (with optimization)	Excellent
Write Amplification	Higher (10-30x)	Lower (2-5x)
Space Amplification	Higher (1.3-2x)	Lower (1.1-1.3x)
Compaction	Required	Not needed
Best For	Write-heavy, large datasets	Read-heavy, small datasets

LSM trees excel when write throughput is critical and you can tolerate background compaction. B-trees are better for read-heavy workloads or when consistent low latency is required.

Next Steps

Compaction

Learn how compaction optimizes LSM structure

Architecture

Understand overall RocksDB architecture

Write-Ahead Log

Deep dive into WAL for durability

Tuning Guide

Optimize LSM for your workload

Get Started

Core Concepts

Developer Guide

Advanced Topics

Language Bindings

Tools & Utilities

LSM Tree Design

What is an LSM Tree?

Write Path

Step-by-Step Write Process

Level Structure

Level 0 (L0)

Levels 1 through N

Read Path

Read Search Order

Range Scans

Merge Algorithms

Optimizing LSM Performance

Write Optimization

Read Optimization

Space Optimization

LSM vs B-Tree

Next Steps

Compaction

Architecture

Write-Ahead Log

Tuning Guide

Build docs developers (and LLMs) love

Get Started

Core Concepts

Developer Guide

Advanced Topics

Language Bindings

Tools & Utilities

​What is an LSM Tree?

​Write Path

​Step-by-Step Write Process

​Level Structure

​Level 0 (L0)

​Levels 1 through N

​Read Path

​Read Search Order

​Range Scans

​Merge Algorithms

​Optimizing LSM Performance

​Write Optimization

​Read Optimization

​Space Optimization

​LSM vs B-Tree

​Next Steps

Compaction

Architecture

Write-Ahead Log

Tuning Guide

Build docs developers (and LLMs) love

What is an LSM Tree?

Write Path

Step-by-Step Write Process

Level Structure

Level 0 (L0)

Levels 1 through N

Read Path

Read Search Order

Range Scans

Merge Algorithms

Optimizing LSM Performance

Write Optimization

Read Optimization

Space Optimization

LSM vs B-Tree

Next Steps