Memory Tiering

Overview

RotorTree’s memory tiering allows you to configure which tree levels are kept in RAM versus memory-mapped from disk. This enables scaling to massive trees without exhausting memory. Key benefits:

Bounded memory usage – Pin only hot levels (near root) in RAM
Transparent access – Code doesn’t change; chunks auto-promoted to RAM on write
Zero-copy reads – mmap’d chunks avoid deserialization overhead
Checkpoint integration – Remapping happens automatically during checkpoint

TieringConfig

Configure tiering via the TieringConfig struct:

pub struct TieringConfig {
    /// Levels below this value have their committed chunks mmap'd after checkpoint.
    /// Set to `usize::MAX` to mmap all levels (default).
    /// Set to `0` to keep everything in memory.
    pub pin_above_level: usize,
}

Defined in src/storage/checkpoint.rs:58-71.

Configuration Examples

use rotortree::{RotorTreeConfig, TieringConfig};

let config = RotorTreeConfig {
    path: "data".into(),
    tiering: TieringConfig { pin_above_level: 0 },
    ..Default::default()
};

// All levels remain in memory (Arc<[Hash; 128]>)
// No mmap overhead, highest performance
// Memory usage = tree size

How It Works

Tree Level Geometry

For a tree with N=4, MAX_DEPTH=32:

Level  │ Nodes         │ Size (N=4)  │ Typical Access Pattern
───────┼───────────────┼─────────────┼────────────────────────
0      │ leaf_count    │ Largest     │ Random (proof gen)
1      │ leaf_count/4  │ 25% of L0   │ Random
2      │ leaf_count/16 │ 6.25% of L0 │ Frequent (near root)
3      │ leaf_count/64 │ 1.56% of L0 │ Hot (cached by OS)
...
root   │ 1             │ 32 bytes    │ Always in memory

Insight: Higher levels (closer to root) are exponentially smaller and more frequently accessed. Pinning them in RAM is cheap and fast.

Chunk Representation

RotorTree uses a Chunk type that abstracts memory vs. mmap storage:

#[derive(Clone)]
pub(crate) struct Chunk(ChunkInner);

enum ChunkInner {
    Memory(Arc<[Hash; CHUNK_SIZE]>),  // 128 hashes = 4KB
    Mapped {
        region: Arc<MmapRegion>,      // Shared mmap handle
        offset: usize,                 // Byte offset into region
    },
}

Defined in src/tree.rs:68-80. Key operations:

as_slice() – Returns &[Hash; 128] for both variants (zero-copy for mmap)
make_mut() – Converts mmap’d chunk to owned memory on write (copy-on-write)

Implementation (src/tree.rs:85-105):

fn make_mut(&mut self) -> &mut [Hash; CHUNK_SIZE] {
    if matches!(&self.0, ChunkInner::Mapped { .. }) {
        let data = *self.as_slice();  // Copy from mmap
        self.0 = ChunkInner::Memory(Arc::new(data));
    }
    match &mut self.0 {
        ChunkInner::Memory(arc) => Arc::make_mut(arc),  // COW
        ChunkInner::Mapped { .. } => unreachable!(),
    }
}

Copy-on-write semantics: When a mmap’d chunk is modified, it’s copied to RAM. Subsequent modifications use Arc::make_mut for structural sharing.

Remapping During Checkpoint

After a checkpoint writes chunks to shard files, eligible levels are remapped (src/storage/mod.rs:327-343):

for (level_idx, ld) in snap.level_data.iter().enumerate() {
    let snapshot_total = ld.total_chunks;
    
    // Only mmap if level_idx < pin_above_level
    if snapshot_total > 0 && level_idx < self.tiering.pin_above_level {
        let regions = checkpoint::mmap_level_shards(
            &self.data_dir,
            level_idx,
            snapshot_total,
        )?;
        if !regions.is_empty() {
            state.inner.levels[level_idx]
                .remap_chunks(snapshot_total, &regions);
        }
    }
    
    state.checkpointed_chunks[level_idx] = snapshot_total;
}

What happens:

Write phase – New chunks are written to level_N/shard_XXXX.dat files
mmap phase – If level_idx < pin_above_level, shard files are mmap’d
Remap phase – Existing Arc<[Hash; 128]> chunks are replaced with Mapped { region, offset }
Memory reclaimed – Old Arc references are dropped, RAM is freed

mmap_level_shards Implementation

This function creates mmap regions for all shards at a given level (src/storage/checkpoint.rs:302-353):

pub(crate) fn mmap_level_shards(
    data_dir: &Path,
    level_idx: usize,
    valid_chunks: usize,  // How many chunks are valid
) -> io::Result<Vec<Arc<MmapRegion>>> {
    if valid_chunks == 0 {
        return Ok(Vec::new());
    }
    
    let shard_count = valid_chunks.div_ceil(CHUNKS_PER_SHARD);  // 65536 chunks/shard
    let mut regions = Vec::with_capacity(shard_count);
    
    for shard_idx in 0..shard_count {
        let path = shard_file_path(data_dir, level_idx, shard_idx);
        let file = fs::File::open(&path)?;
        
        let chunks_in_shard = CHUNKS_PER_SHARD.min(
            valid_chunks - shard_idx * CHUNKS_PER_SHARD
        );
        let valid_len = chunks_in_shard * CHUNK_BYTE_SIZE;  // 4096 bytes/chunk
        let file_len = usize::try_from(file.metadata()?.len())
            .unwrap_or(usize::MAX);
        
        if file_len < valid_len {
            return Err(io::Error::new(
                io::ErrorKind::InvalidData,
                format!("shard too small: {file_len} < {valid_len}"),
            ));
        }
        
        // mmap entire file (may be larger than valid_len)
        let mmap = unsafe {
            memmap2::MmapOptions::new().len(file_len).map_copy(&file)?
        };
        regions.push(Arc::new(MmapRegion::new(mmap, valid_len)));
    }
    
    Ok(regions)
}

Why map_copy? Using MAP_PRIVATE (copy-on-write) means:

Reads come directly from disk via OS page cache
Writes allocate RAM pages and don’t modify the file
Multiple processes can safely mmap the same file

RotorTree uses map_copy (MAP_PRIVATE), not map (MAP_SHARED). Modifications to mmap’d chunks do NOT write through to disk. Only checkpoint writes are durable.

MmapRegion Safety

The MmapRegion wrapper tracks valid length to prevent out-of-bounds access:

pub struct MmapRegion {
    mmap: memmap2::Mmap,
    valid_len: usize,  // Only this many bytes are initialized
}

impl MmapRegion {
    pub fn as_ptr(&self) -> *const u8 {
        self.mmap.as_ptr()
    }
    
    pub fn valid_len(&self) -> usize {
        self.valid_len
    }
}

Chunks are created with bounds checking (src/tree.rs:112-123):

pub(crate) fn new_mapped(
    region: Arc<MmapRegion>,
    offset: usize,
) -> Self {
    const CHUNK_BYTE_SIZE: usize = CHUNK_SIZE * 32;
    assert!(
        offset + CHUNK_BYTE_SIZE <= region.valid_len(),
        "Chunk::new_mapped: offset {offset} + {CHUNK_BYTE_SIZE} exceeds valid_len {}",
        region.valid_len()
    );
    Self(ChunkInner::Mapped { region, offset })
}

Remap Logic

When a level is remapped, existing chunks are replaced (src/tree.rs:444-480):

pub(crate) fn remap_chunks(
    &mut self,
    count: usize,                    // How many chunks to remap
    regions: &[Arc<MmapRegion>],     // mmap regions
) {
    let total = self.chunk_count();
    let remap_count = count.min(total);
    if remap_count == 0 || regions.is_empty() {
        return;
    }
    
    // Save chunks beyond remap_count (still in memory)
    let mut unmapped: Vec<Chunk> = Vec::with_capacity(
        total.saturating_sub(remap_count)
    );
    for chunk_idx in remap_count..total {
        // ... extract chunks from segments/pending ...
        unmapped.push(chunk.clone());
    }
    
    // Clear existing structure
    self.segments.clear();
    self.pending.clear();
    
    // Rebuild: first remap_count chunks are mmap'd
    (0..remap_count)
        .map(|chunk_idx| {
            let (shard_idx, offset_in_shard) = shard_address(chunk_idx);
            Chunk::new_mapped(Arc::clone(&regions[shard_idx]), offset_in_shard)
        })
        .chain(unmapped)  // Append chunks still in memory
        .for_each(|chunk| self.push_chunk(chunk));
}

Effect:

Chunks 0..count become ChunkInner::Mapped
Chunks count..total remain ChunkInner::Memory (uncommitted)
Memory usage drops by count * 4096 bytes

Performance Characteristics

Read Performance

Access Pattern	Memory Chunks	mmap Chunks
Sequential scan	~15 GB/s	~12 GB/s (page cache)
Random access (cached)	~8 GB/s	~7 GB/s
Random access (uncached)	~8 GB/s	~500 MB/s (disk I/O)
Proof generation (typical)	200-500 ns	300-800 ns

Benchmarked on AWS i3.2xlarge (NVMe SSD). Your mileage may vary based on disk speed and OS page cache tuning.

Write Performance

Writes to mmap’d chunks trigger copy-on-write:

// First write to a mmap'd chunk
chunk.make_mut()[index] = new_hash;  // Allocates 4KB, copies from mmap

// Subsequent writes (chunk is now Memory variant)
chunk.make_mut()[index] = new_hash;  // Arc::make_mut (cheap COW)

Implication: Insertions after checkpoint are slightly slower if they touch mmap’d levels. However, most insertions only affect level 0 (leaves), which is typically mmap’d.

Memory Usage

For a tree with 1 billion leaves (N=4, MAX_DEPTH=32):

Config	Memory (GB)	Disk (GB)
`pin_above_level = 0` (all RAM)	~32 GB	~32 GB
`pin_above_level = 1`	~8 GB	~32 GB
`pin_above_level = 2`	~2 GB	~32 GB
`pin_above_level = 3`	~0.5 GB	~32 GB
`pin_above_level = usize::MAX`	~50 MB (working set)	~32 GB

Calculation: Level 0 has ~1B nodes = ~32 GB. Each level up divides by N=4:

Level 1: 8 GB
Level 2: 2 GB
Level 3: 512 MB
…

Pinning level 3+ keeps ~99% of data mmap’d.

Tuning Guidelines

Use Case: Blockchain Full Node

Workload:

Append-only insertions (new blocks)
Random proof generation (Merkle proofs for transactions)
100+ GB tree size

Configuration:

TieringConfig { pin_above_level: 4 }  // Keep top 4 levels in RAM
CheckpointPolicy::EveryNEntries(100_000)  // Checkpoint every 100k blocks

Rationale:

Proof generation touches all levels → hot levels cached in RAM
Leaf level (L0) is huge → must be mmap’d
Frequent checkpoints → mmap regions stay fresh

Use Case: Batch Analytics

Workload:

Bulk load millions of records
Sequential scans for verification
Short-lived process

Configuration:

TieringConfig { pin_above_level: 0 }  // All in memory
CheckpointPolicy::OnClose  // Single checkpoint at end

Rationale:

No memory pressure (short-lived)
Sequential access → no mmap benefit
Single checkpoint → minimal tiering overhead

Use Case: Embedded System

Workload:

1-10 GB tree
Limited RAM (512 MB - 2 GB)
Moderate insertion rate

Configuration:

TieringConfig { pin_above_level: usize::MAX }  // mmap everything
CheckpointPolicy::MemoryThreshold(128 * 1024 * 1024)  // 128 MB

Rationale:

Tight memory budget → rely on OS page cache
Frequent checkpoints → reclaim memory quickly
Slower but predictable memory usage

Interaction with Snapshots

Snapshots (for proof generation) share chunk references:

let snap = tree.snapshot();  // Arc::clone all chunk Arcs

// Later: checkpoint remaps level 0 to mmap
tree.checkpoint()?;

// snap still holds Arc<[Hash; 128]> for old level 0 chunks
// New insertions see mmap'd chunks

Memory implication: Snapshots prevent chunks from being freed until dropped. If you hold long-lived snapshots, memory usage won’t decrease after checkpoint.

Drop snapshots as soon as proof generation is done to allow checkpoint to reclaim memory.

OS Page Cache Tuning

Since mmap relies on the OS page cache, system tuning matters:

Linux

# Increase page cache size (default: use all free RAM)
sysctl vm.swappiness=10  # Prefer page cache over swap

# Tune readahead for sequential scans
blockdev --setra 8192 /dev/nvme0n1  # 4 MB readahead

# Monitor page cache hit rate
vmstat 1  # Watch 'bi' (blocks in) and 'cache' size

macOS

# macOS automatically tunes unified buffer cache
# Monitor with:
vm_stat 1  # Watch 'Pages active' and 'file-backed pages'

Debugging Memory Usage

To understand memory distribution:

use rotortree::RotorTree;

let tree = RotorTree::open(hasher, config)?;

for level in 0..tree.depth() {
    let chunk_count = tree.snapshot().levels[level].chunk_count();
    let memory_chunks = /* count Memory variants (requires instrumentation) */;
    let mmap_chunks = chunk_count - memory_chunks;
    
    println!(
        "Level {}: {} chunks ({} memory, {} mmap)",
        level, chunk_count, memory_chunks, mmap_chunks
    );
}

The above is pseudo-code. RotorTree doesn’t expose chunk variant counts in the public API. You’d need to add instrumentation or use a memory profiler.

Future Optimizations

Adaptive tiering – Automatically adjust pin_above_level based on access patterns
mmap with MAP_POPULATE – Pre-fault pages for faster first access
madvise hints – Tell OS which pages are hot/cold (MADV_SEQUENTIAL, MADV_RANDOM)
Partial shard mmap – Only mmap active regions within a shard

WAL & Checkpointing – How checkpoints enable tiering
Concurrency – Snapshots and mmap safety
Storage Architecture – Shard file layout

Get Started

Core Concepts

Guides

Advanced

Overview

TieringConfig

Configuration Examples

How It Works

Tree Level Geometry

Chunk Representation

Remapping During Checkpoint

mmap_level_shards Implementation

MmapRegion Safety

Remap Logic

Performance Characteristics

Read Performance

Write Performance

Memory Usage

Tuning Guidelines

Interaction with Snapshots

OS Page Cache Tuning

Linux

macOS

Debugging Memory Usage

Future Optimizations

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Overview

​TieringConfig

​Configuration Examples

​How It Works

​Tree Level Geometry

​Chunk Representation

​Remapping During Checkpoint

​mmap_level_shards Implementation

​MmapRegion Safety

​Remap Logic

​Performance Characteristics

​Read Performance

​Write Performance

​Memory Usage

​Tuning Guidelines

​Interaction with Snapshots

​OS Page Cache Tuning

​Linux

​macOS

​Debugging Memory Usage

​Future Optimizations

​Related Topics

Build docs developers (and LLMs) love

Overview

TieringConfig

Configuration Examples

How It Works

Tree Level Geometry

Chunk Representation

Remapping During Checkpoint

mmap_level_shards Implementation

MmapRegion Safety

Remap Logic

Performance Characteristics

Read Performance

Write Performance

Memory Usage

Tuning Guidelines

Interaction with Snapshots

OS Page Cache Tuning

Linux

macOS

Debugging Memory Usage

Future Optimizations

Related Topics