Skip to main content
The Log-Structured Merge (LSM) tree is the foundational data structure behind RocksDB’s high performance. It optimizes for write-heavy workloads by batching writes in memory and organizing data hierarchically on disk.

What is an LSM Tree?

An LSM tree is a data structure that maintains key-value pairs across multiple levels of storage:
  • Memory (MemTable): Active write buffer
  • Disk (SST Files): Immutable sorted files organized in levels
Unlike B-trees which update in place, LSM trees append writes and periodically merge sorted runs. This design makes writes much faster but requires background compaction.

Write Path

When data is written to RocksDB, it follows this path:
#include <rocksdb/db.h>

rocksdb::DB* db;
rocksdb::Options options;
options.create_if_missing = true;

// Configure MemTable size
options.write_buffer_size = 64 << 20; // 64MB
options.max_write_buffer_number = 3;  // Max 3 MemTables in memory

rocksdb::DB::Open(options, "/tmp/testdb", &db);

// Write operation
rocksdb::WriteOptions write_opts;
rocksdb::Status s = db->Put(write_opts, "user:1001", "{\"name\":\"Alice\"}");

Step-by-Step Write Process

First, the write is appended to the Write-Ahead Log (WAL) for durability:
rocksdb::WriteOptions write_opts;
write_opts.sync = true; // Force fsync for durability
write_opts.disableWAL = false; // Ensure WAL is enabled

db->Put(write_opts, "key", "value");
The WAL ensures data isn’t lost if the process crashes before MemTable is flushed.

Level Structure

RocksDB organizes SST files into multiple levels, each with specific characteristics:

Level 0 (L0)

L0 contains recently flushed MemTables:
  • Files may have overlapping key ranges
  • Read queries must check all L0 files
  • Controlled by level0_file_num_compaction_trigger
// Configure L0 behavior
options.level0_file_num_compaction_trigger = 4; // Compact when 4 L0 files
options.level0_slowdown_writes_trigger = 20;    // Slow writes at 20 files
options.level0_stop_writes_trigger = 36;        // Stop writes at 36 files
Too many L0 files degrades read performance. RocksDB will slow or stop writes to allow compaction to catch up.

Levels 1 through N

Lower levels have non-overlapping files within each level:
// Configure level sizes
options.max_bytes_for_level_base = 256 << 20;    // 256MB for L1
options.max_bytes_for_level_multiplier = 10;     // Each level 10x larger

// L1: 256MB
// L2: 2.56GB
// L3: 25.6GB
// L4: 256GB
// etc.
  • Non-overlapping: Each key exists in at most one file per level
  • Sorted: Files are ordered by key range
  • Binary search: Can quickly locate the correct file for a key
  • Compaction targets: Files are compacted from Li to Li+1

Read Path

Reading data requires checking multiple locations:
std::string value;
rocksdb::ReadOptions read_opts;
rocksdb::Status s = db->Get(read_opts, "user:1001", &value);

if (s.ok()) {
  // Found the value
} else if (s.IsNotFound()) {
  // Key doesn't exist
}

Read Search Order

  1. Active MemTable - Check current write buffer (O(log n))
  2. Immutable MemTables - Check MemTables being flushed
  3. Block Cache - Check if block is cached in memory
  4. L0 SST Files - Check all L0 files (newest to oldest)
  5. L1-LN SST Files - Binary search to find correct file per level
// Enable block cache for faster reads
std::shared_ptr<rocksdb::Cache> cache = rocksdb::NewLRUCache(512 << 20);

rocksdb::BlockBasedTableOptions table_options;
table_options.block_cache = cache;

// Enable bloom filters to skip files
table_options.filter_policy.reset(rocksdb::NewBloomFilterPolicy(10));

options.table_factory.reset(
    rocksdb::NewBlockBasedTableFactory(table_options));
Bloom filters can quickly determine if a key is NOT in an SST file, avoiding expensive disk reads. With a 10-bit filter, false positive rate is about 1%.

Range Scans

LSM trees efficiently support range queries:
// Create an iterator
rocksdb::Iterator* it = db->NewIterator(rocksdb::ReadOptions());

// Range scan: all users from 1000-2000
for (it->Seek("user:1000"); it->Valid(); it->Next()) {
  rocksdb::Slice key = it->key();
  if (key.compare("user:2000") > 0) break;
  
  std::cout << key.ToString() << ": " << it->value().ToString() << std::endl;
}

delete it;
Iterators merge-sort data from all levels, providing a consistent view of the database.

Merge Algorithms

RocksDB uses efficient merge algorithms to combine data from multiple levels:
Search from newest to oldest data:
  1. MemTable
  2. L0 (all files, newest first)
  3. L1 (binary search to file)
  4. L2 (binary search to file)
Stop at first occurrence of key.

Optimizing LSM Performance

Write Optimization

// Larger MemTables reduce flush frequency
options.write_buffer_size = 128 << 20; // 128MB

// More MemTables reduce write stalls
options.max_write_buffer_number = 4;

// Increase L1 size to reduce compaction from L0
options.max_bytes_for_level_base = 512 << 20; // 512MB

Read Optimization

// Larger block cache improves read performance
std::shared_ptr<rocksdb::Cache> cache = rocksdb::NewLRUCache(2 << 30); // 2GB

rocksdb::BlockBasedTableOptions table_options;
table_options.block_cache = cache;

// Bloom filters reduce disk reads
table_options.filter_policy.reset(rocksdb::NewBloomFilterPolicy(10));

// Index blocks in cache for faster lookup
table_options.cache_index_and_filter_blocks = true;

Space Optimization

// Aggressive compaction reduces space amplification
options.max_bytes_for_level_multiplier = 8; // Smaller multiplier

// Enable compression
options.compression = rocksdb::kSnappyCompression;

// Use different compression for bottom level
options.bottommost_compression = rocksdb::kZSTD;

LSM vs B-Tree

CharacteristicLSM TreeB-Tree
Write PerformanceExcellent (sequential)Good (random)
Read PerformanceGood (with optimization)Excellent
Write AmplificationHigher (10-30x)Lower (2-5x)
Space AmplificationHigher (1.3-2x)Lower (1.1-1.3x)
CompactionRequiredNot needed
Best ForWrite-heavy, large datasetsRead-heavy, small datasets
LSM trees excel when write throughput is critical and you can tolerate background compaction. B-trees are better for read-heavy workloads or when consistent low latency is required.

Next Steps

Compaction

Learn how compaction optimizes LSM structure

Architecture

Understand overall RocksDB architecture

Write-Ahead Log

Deep dive into WAL for durability

Tuning Guide

Optimize LSM for your workload

Build docs developers (and LLMs) love