Deep dive into Log-Structured Merge tree architecture in RocksDB
The Log-Structured Merge (LSM) tree is the foundational data structure behind RocksDB’s high performance. It optimizes for write-heavy workloads by batching writes in memory and organizing data hierarchically on disk.
An LSM tree is a data structure that maintains key-value pairs across multiple levels of storage:
Memory (MemTable): Active write buffer
Disk (SST Files): Immutable sorted files organized in levels
Unlike B-trees which update in place, LSM trees append writes and periodically merge sorted runs. This design makes writes much faster but requires background compaction.
First, the write is appended to the Write-Ahead Log (WAL) for durability:
rocksdb::WriteOptions write_opts;write_opts.sync = true; // Force fsync for durabilitywrite_opts.disableWAL = false; // Ensure WAL is enableddb->Put(write_opts, "key", "value");
The WAL ensures data isn’t lost if the process crashes before MemTable is flushed.
Next, the key-value pair is inserted into the active MemTable:
// MemTable keeps data sorted by key// Uses skip list by default for O(log n) insertsoptions.memtable_factory.reset(rocksdb::NewSkipListFactory());// Alternative: Hash skip list for point lookups// options.memtable_factory.reset(// rocksdb::NewHashSkipListRepFactory());
The MemTable maintains sorted order for efficient iteration.
When the MemTable reaches its size limit, it becomes immutable and is flushed to disk as an L0 SST file:
// Trigger flush when MemTable is fulloptions.write_buffer_size = 64 << 20;// Maximum immutable MemTables before blocking writesoptions.max_write_buffer_number = 3;// Manual flushrocksdb::FlushOptions flush_opts;flush_opts.wait = true;db->Flush(flush_opts);
Bloom filters can quickly determine if a key is NOT in an SST file, avoiding expensive disk reads. With a 10-bit filter, false positive rate is about 1%.
LSM trees excel when write throughput is critical and you can tolerate background compaction. B-trees are better for read-heavy workloads or when consistent low latency is required.