Overview
RotorTree implements durability through a write-ahead log (WAL) combined with periodic checkpointing. This design provides:- Immediate durability – Insertions are fsynced to the WAL before returning
- Fast recovery – Replay only uncommitted WAL entries after a checkpoint
- Zero-copy deserialization – WAL entries use
wincodefor efficient frame parsing - Memory efficiency – Checkpoints allow WAL truncation and mmap-backed storage
WAL Frame Structure
Every WAL entry is written as a framed record with CRC32 integrity checking:- Corruption detection – CRC32 over length + payload
- Truncation resilience – Partial writes at tail are safely ignored
- Sequential append – No random access required during recovery
Frame Implementation
Frames are serialized using thewincode schema:
- Sequence numbers – Every entry has a monotonic
seqfor gap detection (src/storage/wal.rs:50) - Cow semantics – Batch payloads avoid allocation during serialization (src/storage/wal.rs:21-24)
- Versioned enums – Future WAL format changes won’t break existing data
WAL Header
The WAL file begins with a header containing tree configuration:If you change the branching factor
N or MAX_DEPTH, you must delete the existing WAL or create a new data directory.Checkpoint Policies
Checkpointing materializes the in-memory tree state to disk, allowing WAL truncation. Choose a policy based on your workload:Policy Selection Guide
Manual – Full Control
Manual – Full Control
Use when:
- You want deterministic checkpoint timing
- Coordinating with application-level transactions
- Running benchmarks or tests
- WAL grows unbounded until explicit checkpoint
- Recovery time increases with WAL size
EveryNEntries – Predictable Recovery
EveryNEntries – Predictable Recovery
Use when:Trade-offs:
- You want bounded recovery time (max N entries to replay)
- Write rate is relatively uniform
- Prefer simple tuning knobs
- May checkpoint too frequently if batch insertions are large
MemoryThreshold – Bounded Memory
MemoryThreshold – Bounded Memory
Use when:Implementation:
Tracks
- Write sizes vary dramatically (mix of single/batch inserts)
- You want to cap uncommitted memory usage
- Running with limited RAM
uncheckpointed_memory_bytes atomically (src/storage/mod.rs:129, 541-542, 578-580).OnClose – Minimal Overhead
OnClose – Minimal Overhead
Use when:
- Tree is short-lived or batch-processed
- Recovery time on restart is acceptable
- Want zero checkpoint overhead during operation
- Entire WAL must be replayed on recovery
- No disk space reclaimed until close
Checkpoint Workflow
Checkpoints run on a background thread (src/storage/mod.rs:753-791):- Atomicity –
checkpoint.metais written last using atomic rename (src/storage/checkpoint.rs:158-169) - Idempotency – Partial checkpoint can be safely retried
- Snapshot isolation – Insertions continue on the tree while checkpoint runs
Checkpoint Thread Lifecycle
The checkpoint thread waits on a condvar and wakes when:- Auto-checkpoint threshold is met (src/storage/mod.rs:388-394)
- Manual
checkpoint()is called - Tree is closing (src/storage/mod.rs:656-664)
Shard Files
Each tree level’s committed chunks are stored in shard files:- mmap limits – Some systems limit mmap size; sharding stays under limits
- Partial mmap – Can mmap only hot shards if tiering is configured
- Parallel writes – Multiple shards can be written concurrently (future optimization)
Shard Addressing
Recovery Process
OnRotorTree::open(), recovery happens in two phases:
Phase 1: Checkpoint Recovery
Ifdata/checkpoint.meta exists:
Verification overhead: Recomputing the root hash requires hashing every node in the tree. Only enable
verify_checkpoint if you suspect disk corruption.Phase 2: WAL Replay
After loading the checkpoint (or from empty state):- Partial write that was never fsynced (safe to ignore)
- Corruption (tail CRC check failed earlier)
- File was manually truncated
Flush Policies
WAL writes are buffered in memory and flushed based onFlushPolicy:
Flush Thread
WithFlushPolicy::Interval(duration), a background thread runs (src/storage/mod.rs:728-751):
- 10ms interval ≈ 100 flushes/sec max throughput
- Increase interval for higher batch throughput, lower durability guarantee
- Use
Manualfor custom flush logic (e.g., flush on commit)
Durability Tokens
Insertions return aDurabilityToken that you can .wait() on:
insert_durable() which flushes and waits automatically (src/storage/mod.rs:592-596).
Crash Consistency
RotorTree guarantees:- No data loss – All fsynced WAL entries are replayed on recovery
- No corruption – CRC32 detects any partial writes or bit flips
- Consistent state – Tree is always in a valid state after recovery (root hash matches tree contents)
- Entries in the WAL buffer (not fsynced) are lost on crash
- If
checkpoint.metawrite partially succeeds, recovery falls back to full WAL replay
Atomic Writes
Critical files use atomic write-rename (src/storage/checkpoint.rs:355-370):checkpoint.meta, header.bin, and tails.bin are either fully written or not visible.
On some filesystems (e.g., ext4 with
data=writeback), directory fsync may not provide full durability. Consider using data=ordered or data=journal for mission-critical deployments.Monitoring Checkpoint Activity
You can block until a checkpoint completes:Best Practices
High Throughput
- Use
CheckpointPolicy::MemoryThreshold(256_MB) - Set
FlushPolicy::Interval(50ms)for batching - Call
insert_many()instead ofinsert()in loops
Low Latency
- Use
insert_durable()for immediate fsync - Set
FlushPolicy::Interval(1ms)orManual - Checkpoint frequently to keep recovery fast
Memory Constrained
- Use
CheckpointPolicy::MemoryThreshold(32_MB) - Enable memory tiering (see memory-tiering)
- Monitor
uncheckpointed_memory_bytescounter
Fast Recovery
- Use
CheckpointPolicy::EveryNEntries(1000) - Keep WAL small (frequent checkpoints)
- Disable
verify_checkpointfor faster startup
Related Topics
- Memory Tiering – How checkpoints enable mmap-backed storage
- Concurrency – Lock-free snapshots during checkpoint
- Storage Architecture – Overall storage design
