Core Components
RocksDB’s architecture consists of several key components working together:MemTable
The MemTable is an in-memory data structure that holds the most recent writes. When you write data to RocksDB, it first goes into the MemTable:MemTables are backed by a Write-Ahead Log (WAL) to ensure durability. If the process crashes, writes can be recovered from the WAL.
SST Files (Sorted String Tables)
When a MemTable reaches its size limit, it’s flushed to disk as an immutable SST file. SST files are organized into levels:- Level 0 (L0): Contains recently flushed MemTables. Files may have overlapping key ranges.
- Level 1-N: Organized files with non-overlapping key ranges within each level.
Block Cache
The block cache stores frequently accessed data blocks from SST files in memory, improving read performance:LSM Tree Design
RocksDB uses an LSM tree structure that optimizes for write performance:- Write Path
- Read Path
- Write to WAL for durability
- Insert into MemTable
- Return success to client
- Background flush when MemTable is full
- Background compaction to organize data
Amplification Trade-offs
RocksDB provides flexible trade-offs between three amplification factors:Write Amplification Factor (WAF)
Write Amplification Factor (WAF)
The ratio of bytes written to storage versus bytes written by the application. Compaction rewrites data multiple times as it moves through levels.Typical values: 10-30x for level compaction, 2-5x for universal compaction
Read Amplification Factor (RAF)
Read Amplification Factor (RAF)
The number of disk reads required to satisfy a query. More levels and overlapping files increase read amplification.Typical values: 1-10 disk reads depending on configuration
Space Amplification Factor (SAF)
Space Amplification Factor (SAF)
The ratio of database size on disk versus actual data size. Obsolete data isn’t immediately removed until compaction.Typical values: 1.1-2x depending on compaction strategy
Multi-threaded Compaction
RocksDB supports multi-threaded compactions for improved performance with large datasets:For databases storing multiple terabytes, multi-threaded compaction is essential for keeping up with write throughput and maintaining read performance.
Column Families
RocksDB supports multiple column families within a single database, each with independent configuration:Durability and Recovery
RocksDB ensures durability through the Write-Ahead Log:On database restart, RocksDB replays the WAL to recover any writes that were in MemTables but not yet flushed to SST files.
Performance Characteristics
RocksDB is optimized for:- High write throughput: Sequential writes to WAL and MemTable
- Fast point lookups: Bloom filters and block cache
- Efficient range scans: Sorted data within SST files
- Large datasets: Multi-threaded compaction handles terabytes of data
- Flash storage: Designed for SSDs and NVMe drives
Next Steps
LSM Tree Design
Deep dive into how LSM trees work
Column Families
Learn about organizing data with column families
Compaction
Understand compaction strategies
Write-Ahead Log
Explore WAL for durability