Performance Tuning

RotorTree provides extensive tuning knobs at both compile-time and runtime. This guide covers all optimization techniques for different workloads.

Performance Overview

From benchmarks on M4 Pro (14 cores, 48 GB RAM):

Single-Threaded

~1-2M leaves/secBest for: Low variance, predictable latency

Parallel (8 cores)

~10-20M leaves/secBest for: Balanced throughput/latency

Parallel (14 cores)

~190M leaves/sec peakBest for: Maximum throughput

Single-threaded mode has the best performance characteristic in terms of variance. Use it when latency predictability matters more than raw throughput.

Compile-Time Constants

These require modifying source code and recompiling. Located in src/tree.rs:

CHUNK_SIZE (Default: 128)

pub(crate) const CHUNK_SIZE: usize = 128;

What it does:

Defines hashes per chunk for structural sharing
Each chunk = CHUNK_SIZE × 32 bytes = 4 KB (default)

Tradeoffs:

Larger (256): Fewer Arc allocations, lower snapshot cost, coarser copy-on-write granularity
Smaller (64): Finer-grained sharing, higher snapshot cost, more Arc overhead

When to change:

Increase to 256 if you take many snapshots and want lower memory overhead
Decrease to 64 if you rarely snapshot and want finer CoW granularity

CHUNKS_PER_SEGMENT (Default: 256)

const CHUNKS_PER_SEGMENT: usize = 256;

What it does:

Controls how many chunks are frozen into a single Arc slab
Each segment = CHUNKS_PER_SEGMENT × CHUNK_SIZE × 32 bytes = 1 MB (default)

Tradeoffs:

Larger (512): Fewer allocations, longer freezing pauses
Smaller (128): More frequent allocations, smoother latency

When to change:

Increase for batch-heavy workloads
Decrease for interactive workloads with latency constraints

PAR_CHUNK_SIZE (Default: 64)

#[cfg(feature = "parallel")]
const PAR_CHUNK_SIZE: usize = 64;

What it does:

Defines parents per rayon work unit
Smaller = more parallelism, more scheduling overhead

Tradeoffs:

Larger (128): Less overhead, coarser parallelism
Smaller (32): Finer parallelism, better load balancing, more overhead

When to change:

Increase for machines with fewer cores (4-8)
Decrease for machines with many cores (16+)

MAX_FRAME_PAYLOAD (Default: 128 MB)

const MAX_FRAME_PAYLOAD: usize = 128 * 1024 * 1024;

What it does:

Maximum WAL/checkpoint frame payload size
Limits insert_many batch size

When to change:

Increase if you insert more than 128 MB of leaves in a single batch
Formula: batch_size × 32 bytes ≤ MAX_FRAME_PAYLOAD
Default allows ~4M leaves per batch

Increasing MAX_FRAME_PAYLOAD increases memory usage for WAL buffering. Ensure you have sufficient RAM.

Runtime Configuration

ROTORTREE_PARALLEL_THRESHOLD (Default: 1024)

Controls when parallelism kicks in:

export ROTORTREE_PARALLEL_THRESHOLD=2048
cargo run --release --features parallel

What it does:

Minimum parent count before rayon parallelism is used
Below threshold = single-threaded
Above threshold = parallel

Tuning guide:

Measure your batch sizes

let leaves = vec![...];
let parent_count = leaves.len().div_ceil(N);
println!("Parent count: {}", parent_count);

Set threshold based on workload

Small batches (<1000 leaves): ROTORTREE_PARALLEL_THRESHOLD=4096
Medium batches (1K-100K leaves): ROTORTREE_PARALLEL_THRESHOLD=1024 (default)
Large batches (>100K leaves): ROTORTREE_PARALLEL_THRESHOLD=512

Benchmark and iterate

cargo bench --features parallel -- tree_bench_parallel

Rayon Thread Pool

Control parallelism explicitly:

rayon::ThreadPoolBuilder::new()
    .num_threads(8)  // Limit to 8 threads
    .stack_size(2 * 1024 * 1024)  // 2 MB stack per thread
    .build_global()
    .unwrap();

// All insert_many calls now use max 8 threads
let tree = LeanIMT::<Blake3Hasher, 4, 20>::new(Blake3Hasher);
tree.insert_many(&large_batch).unwrap();

Set the thread pool before creating any trees. It’s a global configuration.

Tree Parameters

Branching Factor (N)

Choose N based on your depth vs leaf-count tradeoff:

N=2 (Binary)
N=4 (Quaternary)
N=8 (Octal)
N=16 (Hexadecimal)

LeanIMT::<Blake3Hasher, 2, 20>::new(Blake3Hasher)

Characteristics:

Deepest trees
Smallest proofs (1 sibling per level)
Slowest insertion (most levels to update)
Best for: ZK circuits with size constraints

LeanIMT::<Blake3Hasher, 4, 20>::new(Blake3Hasher)

Characteristics:

Balanced depth/proof-size tradeoff
2-3 siblings per level
Good insertion performance
Best for: Most use cases

LeanIMT::<Blake3Hasher, 8, 16>::new(Blake3Hasher)

Characteristics:

Shallow trees
Larger proofs (up to 7 siblings per level)
Fast insertion
Best for: High throughput, proof size not critical

LeanIMT::<Blake3Hasher, 16, 12>::new(Blake3Hasher)

Characteristics:

Shallowest trees
Largest proofs (up to 15 siblings per level)
Fastest insertion
Best for: Maximum throughput

Capacity comparison:

N	MAX_DEPTH	Max Leaves	Typical Proof Size
2	20	~1M	~640 bytes
4	16	~4.3B	~2 KB
8	10	~1B	~2.5 KB
16	8	~4.3B	~4 KB

Recommendation: Use N=4 for most workloads. It provides the best balance between insertion speed, proof size, and tree depth.

MAX_DEPTH

Set based on expected leaf count:

// Example: 1 billion leaves with N=4
let max_leaves = 4_u64.pow(MAX_DEPTH);
let required_depth = (leaves as f64).log(4.0).ceil() as usize;

// Add buffer for growth
let safe_depth = required_depth + 2;

Setting MAX_DEPTH too high wastes memory (allocates array of that size). Setting it too low causes insertion failures. Add 10-20% headroom.

Storage Configuration

Flush Policy Tuning

High Throughput
Balanced
Strong Durability
Manual (Best Throughput)

FlushPolicy::Interval(Duration::from_millis(100))

Less frequent fsyncs = higher throughput, higher data loss risk (up to 100ms)

FlushPolicy::Interval(Duration::from_millis(10)) // Default

Good balance between durability and performance

FlushPolicy::Interval(Duration::from_millis(1))

Near-synchronous durability, moderate throughput impact

FlushPolicy::Manual

Caller controls flushing. Best when following blockchain as canonical source:

// Process block
for tx in block.transactions {
    tree.insert(tx.nullifier).unwrap();
}

// Flush once per block
tree.flush().unwrap();

Checkpoint Policy Tuning

Write-Heavy Workload
Memory-Constrained
Predictable Recovery
Manual Control

CheckpointPolicy::OnClose

Effect: Never checkpoint during operation, only on closePros: Maximum write throughput, no checkpoint pausesCons: Longer recovery time (must replay full WAL)

CheckpointPolicy::MemoryThreshold(256 * 1024 * 1024) // 256 MB

Effect: Checkpoint when uncheckpointed memory exceeds 256 MBPros: Bounded WAL size, predictable memory usageCons: Checkpoint pauses when threshold hit

CheckpointPolicy::EveryNEntries(10_000)

Effect: Checkpoint after every 10K WAL entriesPros: Fast recovery (max 10K entries to replay)Cons: Frequent checkpoint pauses

CheckpointPolicy::Manual

Effect: Caller controls via tree.checkpoint()Example: Checkpoint once per hour

let mut last_checkpoint = Instant::now();
loop {
    // Process insertions...
    
    if last_checkpoint.elapsed() > Duration::from_secs(3600) {
        tree.checkpoint().unwrap();
        last_checkpoint = Instant::now();
    }
}

Tiering Configuration

Balance memory usage vs proof generation speed:

All Memory (Fastest Proofs)
Balanced
Minimal Memory

TieringConfig { pin_above_level: 0 }

Effect: Keep all levels in memoryMemory: ~1 GB per 100M leaves (with N=4)Proof gen: ~10-50 μs (constant)

TieringConfig { pin_above_level: 3 }

Effect: Mmap levels 0-2, keep 3+ in memoryMemory: ~50 MB per 100M leavesProof gen: ~50-200 μs (depends on page cache)

TieringConfig::default() // pin_above_level: usize::MAX

Effect: Mmap all levels after checkpointMemory: ~10 MB per 100M leavesProof gen: ~100-500 μs (cold), ~50 μs (warm)

With mmap, the OS page cache acts as a transparent cache. First proof access is slow (disk I/O), subsequent accesses are fast (page cache hit).

Benchmarking Your Workload

RotorTree includes comprehensive benchmarks:

# List all benchmarks
cargo bench -- --list

# Run single-threaded benchmarks
cargo bench --bench tree_bench

# Run concurrent benchmarks
cargo bench --features concurrent --bench tree_bench_concurrent

# Run parallel benchmarks
cargo bench --features parallel --bench tree_bench_parallel

# Run storage benchmarks
cargo bench --features storage --bench tree_bench_storage

# Run all benchmarks (~380 total)
cargo bench --features concurrent,parallel,storage

Custom Benchmark Example

use divan::{Bencher, black_box};
use rotortree::{LeanIMT, Blake3Hasher};

#[divan::bench(args = [100, 1_000, 10_000, 100_000])]
fn insert_many_bench(bencher: Bencher, batch_size: usize) {
    let mut tree = LeanIMT::<Blake3Hasher, 4, 20>::new(Blake3Hasher);
    
    let leaves: Vec<[u8; 32]> = (0..batch_size)
        .map(|i| {
            let mut h = [0u8; 32];
            h[..8].copy_from_slice(&i.to_le_bytes());
            h
        })
        .collect();
    
    bencher.bench_local(|| {
        tree.insert_many(black_box(&leaves)).unwrap();
    });
}

fn main() {
    divan::main();
}

Run with:

cargo bench --features parallel

Optimization Checklist

Choose the right mode

In-memory only? → LeanIMT (no storage feature)
Need persistence? → RotorTree (storage feature)
Multiple threads? → Add concurrent feature
Large batches? → Add parallel feature

Select branching factor

Start with N=4 for most workloads
Use N=2 only if proof size is critical (ZK circuits)
Use N=8 or N=16 for maximum throughput

Tune parallel threshold

export ROTORTREE_PARALLEL_THRESHOLD=1024  # Default
# Increase for smaller batches, decrease for larger

Configure storage policies

Write-heavy? → CheckpointPolicy::OnClose
Memory-constrained? → CheckpointPolicy::MemoryThreshold(...)
Fast recovery? → CheckpointPolicy::EveryNEntries(...)

Optimize tiering

Fast proofs? → TieringConfig { pin_above_level: 0 }
Low memory? → TieringConfig::default()

Benchmark and measure

cargo bench --features parallel,storage

Real-World Example: Bulk Loading

From the repository’s bulk_load.rs example:

use rotortree::{
    Blake3Hasher, RotorTree, RotorTreeConfig,
    FlushPolicy, CheckpointPolicy, TieringConfig,
};
use std::{env, path::PathBuf, time::{Duration, Instant}};

const N: usize = 4;
const MAX_DEPTH: usize = 14;

fn env_or<T: std::str::FromStr>(key: &str, default: T) -> T {
    env::var(key)
        .ok()
        .and_then(|v| v.parse().ok())
        .unwrap_or(default)
}

fn main() {
    let total_leaves: u64 = env_or("TOTAL_LEAVES", 100_000_000);
    let block_size: u64 = env_or("BLOCK_SIZE", 1_000_000);
    
    // Optimized for bulk loading
    let config = RotorTreeConfig {
        path: PathBuf::from(".db"),
        flush_policy: FlushPolicy::Interval(Duration::from_millis(10)),
        checkpoint_policy: CheckpointPolicy::MemoryThreshold(256 * 1024 * 1024),
        tiering: TieringConfig::default(),
        verify_checkpoint: false,  // Skip verification for speed
    };
    
    let tree = RotorTree::<Blake3Hasher, N, MAX_DEPTH>::open(
        Blake3Hasher,
        config
    ).unwrap();
    
    let mut inserted = 0u64;
    
    while inserted < total_leaves {
        let this_block = block_size.min(total_leaves - inserted);
        let leaves: Vec<[u8; 32]> = (inserted..inserted + this_block)
            .map(|i| {
                let mut h = [0u8; 32];
                h[..8].copy_from_slice(&i.to_le_bytes());
                h
            })
            .collect();
        
        let block_start = Instant::now();
        let (_root, token) = tree.insert_many(&leaves).unwrap();
        let insert_time = block_start.elapsed();
        
        token.wait();
        let durable_time = block_start.elapsed();
        
        inserted += this_block;
        
        let block_ins_per_sec = this_block as f64 / insert_time.as_secs_f64();
        let durable_ins_per_sec = this_block as f64 / durable_time.as_secs_f64();
        
        println!(
            "{} leaves: {:.0} inserts/sec, {:.0} durable/sec",
            inserted, block_ins_per_sec, durable_ins_per_sec
        );
    }
    
    tree.close().unwrap();
}

Run with:

export TOTAL_LEAVES=100000000
export BLOCK_SIZE=1000000
export ROTORTREE_PARALLEL_THRESHOLD=512

cargo run --release --example bulk_load --features storage,parallel,blake3

Performance Anti-Patterns

Don’t:

Insert one leaf at a time in a loop (use insert_many)
Call flush() after every insert (use FlushPolicy::Interval)
Checkpoint too frequently (use MemoryThreshold or OnClose)
Use concurrent feature if you have single-threaded access
Set ROTORTREE_PARALLEL_THRESHOLD too low (increases overhead)
Verify checkpoint on every recovery in production

Next Steps

Feature Flags

Explore all available feature flags

In-Memory Trees

Learn about the fastest LeanIMT mode

Persistent Storage

Configure WAL and checkpointing

Proof Generation

Optimize proof generation performance

Get Started

Core Concepts

Guides

Advanced

Performance Tuning

Performance Tuning

Performance Overview

Single-Threaded

Parallel (8 cores)

Parallel (14 cores)

Compile-Time Constants

CHUNK_SIZE (Default: 128)

CHUNKS_PER_SEGMENT (Default: 256)

PAR_CHUNK_SIZE (Default: 64)

MAX_FRAME_PAYLOAD (Default: 128 MB)

Runtime Configuration

ROTORTREE_PARALLEL_THRESHOLD (Default: 1024)

Rayon Thread Pool

Tree Parameters

Branching Factor (N)

MAX_DEPTH

Storage Configuration

Flush Policy Tuning

Checkpoint Policy Tuning

Tiering Configuration

Benchmarking Your Workload

Custom Benchmark Example

Optimization Checklist

Real-World Example: Bulk Loading

Performance Anti-Patterns

Next Steps

Feature Flags

In-Memory Trees

Persistent Storage

Proof Generation

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Performance Tuning

​Performance Overview

Single-Threaded

Parallel (8 cores)

Parallel (14 cores)

​Compile-Time Constants

​CHUNK_SIZE (Default: 128)

​CHUNKS_PER_SEGMENT (Default: 256)

​PAR_CHUNK_SIZE (Default: 64)

​MAX_FRAME_PAYLOAD (Default: 128 MB)

​Runtime Configuration

​ROTORTREE_PARALLEL_THRESHOLD (Default: 1024)

​Rayon Thread Pool

​Tree Parameters

​Branching Factor (N)

​MAX_DEPTH

​Storage Configuration

​Flush Policy Tuning

​Checkpoint Policy Tuning

​Tiering Configuration

​Benchmarking Your Workload

​Custom Benchmark Example

​Optimization Checklist

​Real-World Example: Bulk Loading

​Performance Anti-Patterns

​Next Steps

Feature Flags

In-Memory Trees

Persistent Storage

Proof Generation

Build docs developers (and LLMs) love

Performance Tuning

Performance Overview

Compile-Time Constants

CHUNK_SIZE (Default: 128)

CHUNKS_PER_SEGMENT (Default: 256)

PAR_CHUNK_SIZE (Default: 64)

MAX_FRAME_PAYLOAD (Default: 128 MB)

Runtime Configuration

ROTORTREE_PARALLEL_THRESHOLD (Default: 1024)

Rayon Thread Pool

Tree Parameters

Branching Factor (N)

MAX_DEPTH

Storage Configuration

Flush Policy Tuning

Checkpoint Policy Tuning

Tiering Configuration

Benchmarking Your Workload

Custom Benchmark Example

Optimization Checklist

Real-World Example: Bulk Loading

Performance Anti-Patterns

Next Steps