Understanding RocksDB’s Write-Ahead Log for durability and crash recovery
The Write-Ahead Log (WAL) is RocksDB’s mechanism for ensuring durability. Every write operation is first appended to the WAL before being applied to the MemTable, guaranteeing that no committed data is lost even if the process crashes.
#include <rocksdb/db.h>rocksdb::DB* db;rocksdb::Options options;options.create_if_missing = true;rocksdb::DB::Open(options, "/tmp/testdb", &db);// Write operationrocksdb::WriteOptions write_opts;write_opts.sync = true; // Wait for WAL to be fsyncedrocksdb::Status s = db->Put(write_opts, "key", "value");// At this point, data is durable on disk in the WAL
First, the operation is serialized and appended to the WAL file:
rocksdb::WriteOptions write_opts;// Synchronous write (wait for fsync)write_opts.sync = true;db->Put(write_opts, "key", "value");// Asynchronous write (WAL buffered in OS cache)write_opts.sync = false;db->Put(write_opts, "key", "value");
With sync=true, the write latency includes disk fsync time (typically 1-10ms). With sync=false, latency is much lower but recent writes may be lost on system crash.
After WAL append, the operation is applied to the MemTable:
// MemTable is an in-memory write bufferoptions.write_buffer_size = 64 << 20; // 64MB MemTableoptions.max_write_buffer_number = 3; // Max 3 MemTables
The MemTable provides fast read access to recent writes.
Once both steps complete, success is returned:
rocksdb::Status s = db->Put(write_opts, "key", "value");if (s.ok()) { // Write is durable (if sync=true) // or buffered in OS (if sync=false)} else { // Write failed, check error std::cerr << s.ToString() << std::endl;}
rocksdb::WriteOptions sync_write;sync_write.sync = true; // fsync after every write// This write is durable even if system crashes immediately afterdb->Put(sync_write, "critical_key", "critical_value");
Latency: 1-10ms depending on storage device. Use for critical data that cannot be lost.
rocksdb::WriteOptions async_write;async_write.sync = false; // Don't wait for fsync// Much faster, but recent writes may be lost on system crashdb->Put(async_write, "key", "value");
Latency: 10-100μs depending on system. WAL is buffered in OS page cache and fsynced periodically.
rocksdb::WriteOptions async_write;async_write.sync = false;// Write many operations without syncfor (int i = 0; i < 1000; i++) { db->Put(async_write, "key" + std::to_string(i), "value");}// Manually sync the WALrocksdb::FlushWALOptions wal_opts;wal_opts.sync = true;db->FlushWAL(wal_opts);// Now all previous writes are durable
This approach provides good throughput while maintaining durability at application-defined boundaries (e.g., end of transaction).
rocksdb::Options options;// Limit total WAL size (triggers flush)options.max_total_wal_size = 1 << 30; // 1GB// When exceeded, oldest column family is flushed to SST
rocksdb::Options options;// Archive WAL files instead of deletingoptions.wal_ttl_seconds = 3600; // Keep for 1 houroptions.wal_size_limit_mb = 1024; // Keep up to 1GB// Archived WAL files moved to archive/ subdirectory
When RocksDB opens a database, it replays the WAL:
rocksdb::DB* db;rocksdb::Options options;// Open database (automatically replays WAL)rocksdb::Status s = rocksdb::DB::Open(options, "/tmp/testdb", &db);if (s.ok()) { // All committed writes from WAL are now in MemTable // Database is ready for reads and writes} else { std::cerr << "Recovery failed: " << s.ToString() << std::endl;}
std::vector<rocksdb::ColumnFamilyHandle*> handles;std::vector<rocksdb::ColumnFamilyDescriptor> column_families;column_families.push_back(rocksdb::ColumnFamilyDescriptor( rocksdb::kDefaultColumnFamilyName, rocksdb::ColumnFamilyOptions()));column_families.push_back(rocksdb::ColumnFamilyDescriptor( "users", rocksdb::ColumnFamilyOptions()));column_families.push_back(rocksdb::ColumnFamilyDescriptor( "posts", rocksdb::ColumnFamilyOptions()));rocksdb::DB* db;rocksdb::DB::Open(rocksdb::DBOptions(), "/tmp/testdb", column_families, &handles, &db);// All writes to any column family go to the same WALdb->Put(rocksdb::WriteOptions(), handles[0], "config", "v1");db->Put(rocksdb::WriteOptions(), handles[1], "user:1", "Alice");db->Put(rocksdb::WriteOptions(), handles[2], "post:1", "Hello");
Sharing a WAL ensures atomic writes across column families and simplifies recovery, but means WAL size is affected by writes to all column families.
#include <rocksdb/write_batch.h>rocksdb::WriteBatch batch;// Add multiple operationsbatch.Put("key1", "value1");batch.Put("key2", "value2");batch.Delete("key3");// Write atomically to WAL and MemTablerocksdb::WriteOptions write_opts;write_opts.sync = true;rocksdb::Status s = db->Write(write_opts, &batch);// All operations succeed or all fail together
The entire WriteBatch is written as a single record in the WAL, ensuring atomicity even across crashes.