Skip to main content
The Write-Ahead Log (WAL) is RocksDB’s mechanism for ensuring durability. Every write operation is first appended to the WAL before being applied to the MemTable, guaranteeing that no committed data is lost even if the process crashes.

What is the WAL?

The Write-Ahead Log is an append-only file that records all write operations:
  • Sequential writes for high throughput
  • Crash recovery by replaying operations
  • Durability guarantees when fsync is enabled
  • Shared across all column families
The WAL is the foundation of RocksDB’s durability. Without it, only data flushed to SST files would survive a crash.

How WAL Works

Every write operation follows this sequence:
#include <rocksdb/db.h>

rocksdb::DB* db;
rocksdb::Options options;
options.create_if_missing = true;

rocksdb::DB::Open(options, "/tmp/testdb", &db);

// Write operation
rocksdb::WriteOptions write_opts;
write_opts.sync = true; // Wait for WAL to be fsynced

rocksdb::Status s = db->Put(write_opts, "key", "value");
// At this point, data is durable on disk in the WAL

Write Path with WAL

First, the operation is serialized and appended to the WAL file:
rocksdb::WriteOptions write_opts;

// Synchronous write (wait for fsync)
write_opts.sync = true;
db->Put(write_opts, "key", "value");

// Asynchronous write (WAL buffered in OS cache)
write_opts.sync = false;
db->Put(write_opts, "key", "value");
With sync=true, the write latency includes disk fsync time (typically 1-10ms). With sync=false, latency is much lower but recent writes may be lost on system crash.

Durability Modes

RocksDB provides different durability guarantees:

Synchronous Writes

Maximum durability with higher latency:
rocksdb::WriteOptions sync_write;
sync_write.sync = true; // fsync after every write

// This write is durable even if system crashes immediately after
db->Put(sync_write, "critical_key", "critical_value");
Latency: 1-10ms depending on storage device. Use for critical data that cannot be lost.

Asynchronous Writes

Higher throughput with relaxed durability:
rocksdb::WriteOptions async_write;
async_write.sync = false; // Don't wait for fsync

// Much faster, but recent writes may be lost on system crash
db->Put(async_write, "key", "value");
Latency: 10-100μs depending on system. WAL is buffered in OS page cache and fsynced periodically.

Manual WAL Sync

Batch multiple writes, then sync:
rocksdb::WriteOptions async_write;
async_write.sync = false;

// Write many operations without sync
for (int i = 0; i < 1000; i++) {
  db->Put(async_write, "key" + std::to_string(i), "value");
}

// Manually sync the WAL
rocksdb::FlushWALOptions wal_opts;
wal_opts.sync = true;
db->FlushWAL(wal_opts);
// Now all previous writes are durable
This approach provides good throughput while maintaining durability at application-defined boundaries (e.g., end of transaction).

WAL Configuration

WAL Size Limits

rocksdb::Options options;

// Limit total WAL size (triggers flush)
options.max_total_wal_size = 1 << 30; // 1GB

// When exceeded, oldest column family is flushed to SST

WAL Archival

Keep old WAL files for backup or replication:
rocksdb::Options options;

// Archive WAL files instead of deleting
options.wal_ttl_seconds = 3600;        // Keep for 1 hour
options.wal_size_limit_mb = 1024;      // Keep up to 1GB

// Archived WAL files moved to archive/ subdirectory

WAL Recovery Mode

Control how RocksDB handles corrupted WAL on recovery:
rocksdb::Options options;

// Strict mode: fail on any corruption
options.wal_recovery_mode = 
    rocksdb::WALRecoveryMode::kAbsoluteConsistency;

// Tolerant mode: skip corrupt records
// options.wal_recovery_mode = 
//     rocksdb::WALRecoveryMode::kTolerateCorruptedTailRecords;

// Point-in-time recovery: stop at first corruption
// options.wal_recovery_mode = 
//     rocksdb::WALRecoveryMode::kPointInTimeRecovery;

Crash Recovery

When RocksDB opens a database, it replays the WAL:
rocksdb::DB* db;
rocksdb::Options options;

// Open database (automatically replays WAL)
rocksdb::Status s = rocksdb::DB::Open(options, "/tmp/testdb", &db);

if (s.ok()) {
  // All committed writes from WAL are now in MemTable
  // Database is ready for reads and writes
} else {
  std::cerr << "Recovery failed: " << s.ToString() << std::endl;
}

Recovery Process

RocksDB scans the database directory for WAL files:
/tmp/testdb/
  000003.log  (active WAL)
  000002.log  (old WAL, not yet deleted)
  MANIFEST-000001
  ...
Each operation in the WAL is replayed:
  • Puts: Inserted into MemTable
  • Deletes: Marked as deleted in MemTable
  • Merges: Applied to MemTable
Operations are applied in the same order as original writes.
After replay completes:
  • Old WAL files are deleted (unless archived)
  • New WAL file is created for future writes
  • Database is ready for reads and writes

WAL and Column Families

All column families share a single WAL:
std::vector<rocksdb::ColumnFamilyHandle*> handles;
std::vector<rocksdb::ColumnFamilyDescriptor> column_families;

column_families.push_back(rocksdb::ColumnFamilyDescriptor(
    rocksdb::kDefaultColumnFamilyName, rocksdb::ColumnFamilyOptions()));
column_families.push_back(rocksdb::ColumnFamilyDescriptor(
    "users", rocksdb::ColumnFamilyOptions()));
column_families.push_back(rocksdb::ColumnFamilyDescriptor(
    "posts", rocksdb::ColumnFamilyOptions()));

rocksdb::DB* db;
rocksdb::DB::Open(rocksdb::DBOptions(), "/tmp/testdb", 
                   column_families, &handles, &db);

// All writes to any column family go to the same WAL
db->Put(rocksdb::WriteOptions(), handles[0], "config", "v1");
db->Put(rocksdb::WriteOptions(), handles[1], "user:1", "Alice");
db->Put(rocksdb::WriteOptions(), handles[2], "post:1", "Hello");
Sharing a WAL ensures atomic writes across column families and simplifies recovery, but means WAL size is affected by writes to all column families.

Atomic Writes

WriteBatch provides atomic multi-key writes:
#include <rocksdb/write_batch.h>

rocksdb::WriteBatch batch;

// Add multiple operations
batch.Put("key1", "value1");
batch.Put("key2", "value2");
batch.Delete("key3");

// Write atomically to WAL and MemTable
rocksdb::WriteOptions write_opts;
write_opts.sync = true;

rocksdb::Status s = db->Write(write_opts, &batch);
// All operations succeed or all fail together
The entire WriteBatch is written as a single record in the WAL, ensuring atomicity even across crashes.

Disabling WAL

For non-critical writes, you can disable WAL:
rocksdb::WriteOptions no_wal;
no_wal.disableWAL = true;

// Fast write, but not durable until flushed to SST
db->Put(no_wal, "temp_key", "temp_value");
Warning: Data written with WAL disabled is lost if the process crashes before the MemTable is flushed. Only use for temporary or reconstructible data.

Use Cases for Disabled WAL

  • Bulk loading: Fast initial data load, then call Flush()
  • Temporary cache: Data that can be regenerated
  • Replicas: Secondary instances that can re-sync from primary

WAL Performance Tuning

Reduce Sync Overhead

rocksdb::DBOptions options;

// Use faster fdatasync instead of fsync (Linux)
options.use_fsync = false;

// Batch multiple writes before syncing
options.manual_wal_flush = true;

// Application controls when to sync:
rocksdb::FlushWALOptions wal_opts;
wal_opts.sync = true;
db->FlushWAL(wal_opts);

WAL Recycling

rocksdb::DBOptions options;

// Reuse WAL files instead of creating new ones
options.recycle_log_file_num = 10; // Keep 10 WAL files for reuse

// Reduces file creation overhead

Concurrent Writes

rocksdb::DBOptions options;

// Allow concurrent writes to WAL
options.allow_concurrent_memtable_write = true;

// Higher write throughput with multiple writer threads

Monitoring WAL

Get WAL Statistics

// List all WAL files
std::vector<std::unique_ptr<rocksdb::WalFile>> wal_files;
db->GetSortedWalFiles(&wal_files);

for (const auto& wal : wal_files) {
  std::cout << "WAL: " << wal->PathName() 
            << " Size: " << wal->SizeFileBytes() 
            << " Sequence: " << wal->StartSequence()
            << std::endl;
}

Get WAL Size

std::string value;
db->GetProperty("rocksdb.total-wal-size", &value);
std::cout << "Total WAL size: " << value << " bytes" << std::endl;

Best Practices

  1. Use sync=true for critical data that cannot be lost
  2. Batch writes with WriteBatch and sync periodically for throughput
  3. Set max_total_wal_size to prevent unbounded WAL growth
  4. Archive WALs if you need point-in-time recovery or replication
  5. Monitor WAL size and flush frequency to identify issues
  6. Disable WAL only for temporary or reconstructible data
  7. Use recycle_log_file_num to reduce file creation overhead

Next Steps

Architecture

See how WAL fits into RocksDB architecture

Snapshots

Learn about consistent point-in-time reads

Compaction

Understand when WAL files are deleted

Backup & Restore

Use WAL for point-in-time recovery

Build docs developers (and LLMs) love