Write-Ahead Log (WAL)

The Write-Ahead Log (WAL) is RocksDB’s mechanism for ensuring durability. Every write operation is first appended to the WAL before being applied to the MemTable, guaranteeing that no committed data is lost even if the process crashes.

What is the WAL?

The Write-Ahead Log is an append-only file that records all write operations:

Sequential writes for high throughput
Crash recovery by replaying operations
Durability guarantees when fsync is enabled
Shared across all column families

The WAL is the foundation of RocksDB’s durability. Without it, only data flushed to SST files would survive a crash.

How WAL Works

Every write operation follows this sequence:

#include <rocksdb/db.h>

rocksdb::DB* db;
rocksdb::Options options;
options.create_if_missing = true;

rocksdb::DB::Open(options, "/tmp/testdb", &db);

// Write operation
rocksdb::WriteOptions write_opts;
write_opts.sync = true; // Wait for WAL to be fsynced

rocksdb::Status s = db->Put(write_opts, "key", "value");
// At this point, data is durable on disk in the WAL

Write Path with WAL

1. Append to WAL
2. Apply to MemTable
3. Return to Client

First, the operation is serialized and appended to the WAL file:

rocksdb::WriteOptions write_opts;

// Synchronous write (wait for fsync)
write_opts.sync = true;
db->Put(write_opts, "key", "value");

// Asynchronous write (WAL buffered in OS cache)
write_opts.sync = false;
db->Put(write_opts, "key", "value");

With sync=true, the write latency includes disk fsync time (typically 1-10ms). With sync=false, latency is much lower but recent writes may be lost on system crash.

After WAL append, the operation is applied to the MemTable:

// MemTable is an in-memory write buffer
options.write_buffer_size = 64 << 20; // 64MB MemTable
options.max_write_buffer_number = 3;  // Max 3 MemTables

The MemTable provides fast read access to recent writes.

Once both steps complete, success is returned:

rocksdb::Status s = db->Put(write_opts, "key", "value");

if (s.ok()) {
  // Write is durable (if sync=true)
  // or buffered in OS (if sync=false)
} else {
  // Write failed, check error
  std::cerr << s.ToString() << std::endl;
}

Durability Modes

RocksDB provides different durability guarantees:

Synchronous Writes

Maximum durability with higher latency:

rocksdb::WriteOptions sync_write;
sync_write.sync = true; // fsync after every write

// This write is durable even if system crashes immediately after
db->Put(sync_write, "critical_key", "critical_value");

Latency: 1-10ms depending on storage device. Use for critical data that cannot be lost.

Asynchronous Writes

Higher throughput with relaxed durability:

rocksdb::WriteOptions async_write;
async_write.sync = false; // Don't wait for fsync

// Much faster, but recent writes may be lost on system crash
db->Put(async_write, "key", "value");

Latency: 10-100μs depending on system. WAL is buffered in OS page cache and fsynced periodically.

Manual WAL Sync

Batch multiple writes, then sync:

rocksdb::WriteOptions async_write;
async_write.sync = false;

// Write many operations without sync
for (int i = 0; i < 1000; i++) {
  db->Put(async_write, "key" + std::to_string(i), "value");
}

// Manually sync the WAL
rocksdb::FlushWALOptions wal_opts;
wal_opts.sync = true;
db->FlushWAL(wal_opts);
// Now all previous writes are durable

This approach provides good throughput while maintaining durability at application-defined boundaries (e.g., end of transaction).

WAL Configuration

WAL Size Limits

rocksdb::Options options;

// Limit total WAL size (triggers flush)
options.max_total_wal_size = 1 << 30; // 1GB

// When exceeded, oldest column family is flushed to SST

WAL Archival

Keep old WAL files for backup or replication:

rocksdb::Options options;

// Archive WAL files instead of deleting
options.wal_ttl_seconds = 3600;        // Keep for 1 hour
options.wal_size_limit_mb = 1024;      // Keep up to 1GB

// Archived WAL files moved to archive/ subdirectory

WAL Recovery Mode

Control how RocksDB handles corrupted WAL on recovery:

rocksdb::Options options;

// Strict mode: fail on any corruption
options.wal_recovery_mode = 
    rocksdb::WALRecoveryMode::kAbsoluteConsistency;

// Tolerant mode: skip corrupt records
// options.wal_recovery_mode = 
//     rocksdb::WALRecoveryMode::kTolerateCorruptedTailRecords;

// Point-in-time recovery: stop at first corruption
// options.wal_recovery_mode = 
//     rocksdb::WALRecoveryMode::kPointInTimeRecovery;

Crash Recovery

When RocksDB opens a database, it replays the WAL:

rocksdb::DB* db;
rocksdb::Options options;

// Open database (automatically replays WAL)
rocksdb::Status s = rocksdb::DB::Open(options, "/tmp/testdb", &db);

if (s.ok()) {
  // All committed writes from WAL are now in MemTable
  // Database is ready for reads and writes
} else {
  std::cerr << "Recovery failed: " << s.ToString() << std::endl;
}

Recovery Process

Step 1: Identify WAL Files

RocksDB scans the database directory for WAL files:

/tmp/testdb/
  000003.log  (active WAL)
  000002.log  (old WAL, not yet deleted)
  MANIFEST-000001
  ...

Step 2: Replay Operations

Each operation in the WAL is replayed:

Puts: Inserted into MemTable
Deletes: Marked as deleted in MemTable
Merges: Applied to MemTable

Operations are applied in the same order as original writes.

Step 3: Resume Normal Operation

After replay completes:

Old WAL files are deleted (unless archived)
New WAL file is created for future writes
Database is ready for reads and writes

WAL and Column Families

All column families share a single WAL:

std::vector<rocksdb::ColumnFamilyHandle*> handles;
std::vector<rocksdb::ColumnFamilyDescriptor> column_families;

column_families.push_back(rocksdb::ColumnFamilyDescriptor(
    rocksdb::kDefaultColumnFamilyName, rocksdb::ColumnFamilyOptions()));
column_families.push_back(rocksdb::ColumnFamilyDescriptor(
    "users", rocksdb::ColumnFamilyOptions()));
column_families.push_back(rocksdb::ColumnFamilyDescriptor(
    "posts", rocksdb::ColumnFamilyOptions()));

rocksdb::DB* db;
rocksdb::DB::Open(rocksdb::DBOptions(), "/tmp/testdb", 
                   column_families, &handles, &db);

// All writes to any column family go to the same WAL
db->Put(rocksdb::WriteOptions(), handles[0], "config", "v1");
db->Put(rocksdb::WriteOptions(), handles[1], "user:1", "Alice");
db->Put(rocksdb::WriteOptions(), handles[2], "post:1", "Hello");

Sharing a WAL ensures atomic writes across column families and simplifies recovery, but means WAL size is affected by writes to all column families.

Atomic Writes

WriteBatch provides atomic multi-key writes:

#include <rocksdb/write_batch.h>

rocksdb::WriteBatch batch;

// Add multiple operations
batch.Put("key1", "value1");
batch.Put("key2", "value2");
batch.Delete("key3");

// Write atomically to WAL and MemTable
rocksdb::WriteOptions write_opts;
write_opts.sync = true;

rocksdb::Status s = db->Write(write_opts, &batch);
// All operations succeed or all fail together

The entire WriteBatch is written as a single record in the WAL, ensuring atomicity even across crashes.

Disabling WAL

For non-critical writes, you can disable WAL:

rocksdb::WriteOptions no_wal;
no_wal.disableWAL = true;

// Fast write, but not durable until flushed to SST
db->Put(no_wal, "temp_key", "temp_value");

Warning: Data written with WAL disabled is lost if the process crashes before the MemTable is flushed. Only use for temporary or reconstructible data.

Use Cases for Disabled WAL

Bulk loading: Fast initial data load, then call Flush()
Temporary cache: Data that can be regenerated
Replicas: Secondary instances that can re-sync from primary

WAL Performance Tuning

Reduce Sync Overhead

rocksdb::DBOptions options;

// Use faster fdatasync instead of fsync (Linux)
options.use_fsync = false;

// Batch multiple writes before syncing
options.manual_wal_flush = true;

// Application controls when to sync:
rocksdb::FlushWALOptions wal_opts;
wal_opts.sync = true;
db->FlushWAL(wal_opts);

WAL Recycling

rocksdb::DBOptions options;

// Reuse WAL files instead of creating new ones
options.recycle_log_file_num = 10; // Keep 10 WAL files for reuse

// Reduces file creation overhead

Concurrent Writes

rocksdb::DBOptions options;

// Allow concurrent writes to WAL
options.allow_concurrent_memtable_write = true;

// Higher write throughput with multiple writer threads

Monitoring WAL

Get WAL Statistics

// List all WAL files
std::vector<std::unique_ptr<rocksdb::WalFile>> wal_files;
db->GetSortedWalFiles(&wal_files);

for (const auto& wal : wal_files) {
  std::cout << "WAL: " << wal->PathName() 
            << " Size: " << wal->SizeFileBytes() 
            << " Sequence: " << wal->StartSequence()
            << std::endl;
}

Get WAL Size

std::string value;
db->GetProperty("rocksdb.total-wal-size", &value);
std::cout << "Total WAL size: " << value << " bytes" << std::endl;

Best Practices

Use sync=true for critical data that cannot be lost
Batch writes with WriteBatch and sync periodically for throughput
Set max_total_wal_size to prevent unbounded WAL growth
Archive WALs if you need point-in-time recovery or replication
Monitor WAL size and flush frequency to identify issues
Disable WAL only for temporary or reconstructible data
Use recycle_log_file_num to reduce file creation overhead

Next Steps

Architecture

See how WAL fits into RocksDB architecture

Snapshots

Learn about consistent point-in-time reads

Compaction

Understand when WAL files are deleted

Backup & Restore

Use WAL for point-in-time recovery

Get Started

Core Concepts

Developer Guide

Advanced Topics

Language Bindings

Tools & Utilities

Write-Ahead Log (WAL)

What is the WAL?

How WAL Works

Write Path with WAL

Durability Modes

Synchronous Writes

Asynchronous Writes

Manual WAL Sync

WAL Configuration

WAL Size Limits

WAL Archival

WAL Recovery Mode

Crash Recovery

Recovery Process

WAL and Column Families

Atomic Writes

Disabling WAL

Use Cases for Disabled WAL

WAL Performance Tuning

Reduce Sync Overhead

WAL Recycling

Concurrent Writes

Monitoring WAL

Get WAL Statistics

Get WAL Size

Best Practices

Next Steps

Architecture

Snapshots

Compaction

Backup & Restore

Build docs developers (and LLMs) love

Get Started

Core Concepts

Developer Guide

Advanced Topics

Language Bindings

Tools & Utilities

​What is the WAL?

​How WAL Works

​Write Path with WAL

​Durability Modes

​Synchronous Writes

​Asynchronous Writes

​Manual WAL Sync

​WAL Configuration

​WAL Size Limits

​WAL Archival

​WAL Recovery Mode

​Crash Recovery

​Recovery Process

​WAL and Column Families

​Atomic Writes

​Disabling WAL

​Use Cases for Disabled WAL

​WAL Performance Tuning

​Reduce Sync Overhead

​WAL Recycling

​Concurrent Writes

​Monitoring WAL

​Get WAL Statistics

​Get WAL Size

​Best Practices

​Next Steps

Architecture

Snapshots

Compaction

Backup & Restore

Build docs developers (and LLMs) love

What is the WAL?

How WAL Works

Write Path with WAL

Durability Modes

Synchronous Writes

Asynchronous Writes

Manual WAL Sync

WAL Configuration

WAL Size Limits

WAL Archival

WAL Recovery Mode

Crash Recovery

Recovery Process

WAL and Column Families

Atomic Writes

Disabling WAL

Use Cases for Disabled WAL

WAL Performance Tuning

Reduce Sync Overhead

WAL Recycling

Concurrent Writes

Monitoring WAL

Get WAL Statistics

Get WAL Size

Best Practices

Next Steps