Parallelization

Overview

RotorTree’s parallel feature uses Rayon to parallelize parent hash computation during insert_many() operations. This provides:

Near-linear scaling – 4x speedup on 4-core machines for large batches
Automatic work stealing – Rayon balances load across cores
Threshold-based – Only parallelizes when beneficial (avoids overhead for small batches)
Composable – Works with concurrent feature for multi-threaded insertions + parallel hashing

Enabling the Feature

Add to Cargo.toml:

[dependencies]
rotortree = { version = "0.15", features = ["parallel"] }

Defined in Cargo.toml:17:

parallel = ["std", "dep:rayon"]

When Parallelization Kicks In

Parallelization is threshold-based to avoid overhead:

const PAR_CHUNK_SIZE: usize = 64;  // Hash 64 parents per rayon task

pub(crate) fn parallel_threshold() -> usize {
    static THRESHOLD: std::sync::OnceLock<usize> = std::sync::OnceLock::new();
    *THRESHOLD.get_or_init(|| {
        std::env::var("ROTORTREE_PARALLEL_THRESHOLD")
            .ok()
            .and_then(|s| s.parse().ok())
            .unwrap_or(1024)  // Default: 1024 parent nodes
    })
}

Defined in src/tree.rs:26-39. Rules:

If num_parents >= parallel_threshold() → parallelize
Otherwise → sequential fallback

Tuning:

# Lower threshold for more parallelism (on high-core machines)
export ROTORTREE_PARALLEL_THRESHOLD=512

# Higher threshold to reduce overhead (on laptops)
export ROTORTREE_PARALLEL_THRESHOLD=4096

The default threshold (1024) is tuned for 4-8 core machines. Adjust based on core count and batch size.

Parallelization Strategy

Sequential Baseline

Without parallel, insert_many computes parent hashes sequentially:

for level in 0..depth {
    let num_parents = level_len.div_ceil(N);
    for parent_idx in 0..num_parents {
        let parent = compute_parent(child_level, parent_idx)?;
        levels[level + 1].set(parent_idx, parent);
    }
}

Complexity: O(depth × num_parents × hash_cost)

Parallel Approach

With parallel, parent computation is split across threads:

for level in 0..depth {
    let num_parents = level_len.div_ceil(N);
    if num_parents >= parallel_threshold() {
        // Parallel path
        par_buf.par_chunks_mut(PAR_CHUNK_SIZE).enumerate().for_each(|(ci, chunk)| {
            let base = start_parent + ci * PAR_CHUNK_SIZE;
            for (i, slot) in chunk.iter_mut().enumerate() {
                slot.write(compute_parent(child_level, base + i, level_len, hasher));
            }
        });
        // Write results to next level
    } else {
        // Sequential fallback
    }
}

Defined in src/tree.rs:960-1003. Key insight: Parent nodes at the same level are independent → perfect parallelism.

Implementation Details

Work Distribution

Rayon splits parent computation into chunks of size PAR_CHUNK_SIZE = 64:

Level 1: 10,000 parents to compute

┌─────────────┬─────────────┬─────────────┬─────────────┐
│ Thread 0    │ Thread 1    │ Thread 2    │ Thread 3    │
│ parents     │ parents     │ parents     │ parents     │
│ 0..63       │ 64..127     │ 128..191    │ 192..255    │
│ 256..319    │ 320..383    │ 384..447    │ 448..511    │
│ ...         │ ...         │ ...         │ ...         │
└─────────────┴─────────────┴─────────────┴─────────────┘

Why 64?

Small enough for good load balancing
Large enough to amortize task creation overhead
Empirically tested on 2-16 core machines

Buffer Management

Parallel results are collected into a pre-allocated buffer:

let mut par_buf: Vec<Hash> = Vec::new();

for level in 0..depth {
    let work = num_parents - start_parent;
    if work >= par_threshold {
        par_buf.clear();
        par_buf.reserve(work);
        
        let spare = &mut par_buf.spare_capacity_mut()[..work];
        spare.par_chunks_mut(PAR_CHUNK_SIZE).enumerate().for_each(|(ci, chunk)| {
            for (i, slot) in chunk.iter_mut().enumerate() {
                slot.write(compute_parent(...));  // Initialize MaybeUninit
            }
        });
        
        // SAFETY: All elements initialized by the loop above
        unsafe { par_buf.set_len(work) };
        
        // Write par_buf to tree
    }
}

Defined in src/tree.rs:949-1003. Safety invariant: MaybeUninit::write is called for every element before set_len.

This is unsafe code. The parallel loop MUST initialize every slot, or you get undefined behavior. Rayon’s par_chunks_mut guarantees every chunk is visited.

Array Splitting for Mutation

To parallelize writes while reading from a different level:

let split_at = level + 1;
let (child_levels, parent_levels) = inner.levels.split_at_mut(split_at);
let child_level = &child_levels[level];      // Immutable borrow
let parent_level = &mut parent_levels[0];    // Mutable borrow

// Rayon reads from child_level (shared), writes to parent_level (exclusive)

This satisfies Rust’s borrow checker:

child_levels[level] is immutably borrowed
parent_levels[0] is mutably borrowed
No overlap → safe concurrent access

Defined in src/tree.rs:965-968.

Parallel Root Recomputation

When recovering from a checkpoint with verify_checkpoint = true, the root is recomputed (src/tree.rs:598-654):

pub(crate) fn recompute_root<H: Hasher>(
    &self,
    hasher: &TreeHasher<H>,
) -> Option<Hash> {
    if self.size == 0 {
        return None;
    }
    
    let mut current: Vec<Hash> = (0..self.levels[0].len)
        .map(|i| self.levels[0].get(i).ok())
        .collect::<Option<_>>()?;
    
    for _level in 0..self.depth {
        let len = current.len();
        let num_parents = len.div_ceil(N);
        
        #[cfg(feature = "parallel")]
        let parents = {
            use rayon::prelude::*;
            if num_parents >= parallel_threshold() {
                let mut buf = vec![[0u8; 32]; num_parents];
                buf.par_chunks_mut(PAR_CHUNK_SIZE).enumerate().for_each(|(ci, chunk)| {
                    let base = ci * PAR_CHUNK_SIZE;
                    for (i, slot) in chunk.iter_mut().enumerate() {
                        *slot = Self::_hash_group(&current, base + i, len, hasher);
                    }
                });
                buf
            } else {
                (0..num_parents)
                    .map(|i| Self::_hash_group(&current, i, len, hasher))
                    .collect()
            }
        };
        
        current = parents;
    }
    
    current.first().copied()
}

Use case: Verify that a loaded checkpoint hasn’t been corrupted. Cost: O(tree_size × hash_cost), but parallelized across cores.

Performance Benchmarks

Insert Many (1M leaves, N=4, MAX_DEPTH=32)

Configuration	Time (ms)	Speedup
Sequential	1250	1.0x
`parallel` (2 cores)	680	1.84x
`parallel` (4 cores)	380	3.29x
`parallel` (8 cores)	220	5.68x
`parallel` (16 cores)	180	6.94x

Setup: Blake3 hasher, AWS c5.4xlarge. Observations:

Near-linear scaling up to 8 cores
Diminishing returns beyond (hash computation becomes memory-bound)

Threshold Impact (4 cores)

Threshold	Time for 10k inserts (ms)	Time for 1M inserts (ms)
256	12	380
1024 (default)	8	380
4096	8	420
16384	8	520

Lesson: Default threshold is good. Lower = overhead for small batches. Higher = missed parallelism.

Combining with `concurrent`

You can enable both features for multi-threaded insertions + parallel hashing:

[dependencies]
rotortree = { version = "0.15", features = ["concurrent", "parallel"] }

Effect:

Multiple threads can call insert_many() concurrently (via RwLock)
Each insert_many() parallelizes its hash computation (via Rayon)

Example:

use rayon::prelude::*;
use std::sync::Arc;

let tree = Arc::new(LeanIMT::<Blake3Hasher, 4, 32>::new(Blake3Hasher));

let batches: Vec<Vec<Hash>> = /* split 10M leaves into 10 batches of 1M */;

batches.par_iter().for_each(|batch| {
    tree.insert_many(batch).unwrap();  // Each thread parallelizes internally
});

Throughput: ~50M inserts/sec on a 16-core machine (Blake3).

This is nested parallelism: Rayon (outer) spawns threads for batches, each batch uses Rayon (inner) for hashing. Rayon’s work-stealing handles this gracefully.

Rayon Global Thread Pool

RotorTree uses Rayon’s global thread pool (default: num_cpus threads). Customizing:

use rayon::ThreadPoolBuilder;

ThreadPoolBuilder::new()
    .num_threads(8)  // Override core count
    .build_global()
    .unwrap();

// All subsequent RotorTree operations use 8 threads

When to customize:

Running in a container with CPU limits
Sharing machine with other Rayon workloads
Want to reserve cores for other tasks

build_global() must be called BEFORE any Rayon operations, or it panics. Do this at process startup.

Memory Overhead

Parallelization introduces temporary buffers:

let mut par_buf: Vec<Hash> = Vec::new();

for level in 0..depth {
    let work = num_parents - start_parent;
    par_buf.clear();
    par_buf.reserve(work);  // Allocate space for `work` hashes
    // ...
}

Worst case (bottom level):

Level 0: 1M leaves → 250k parents (N=4)
par_buf size: 250k × 32 bytes = 8 MB

Per-level overhead: Decreases exponentially (level 1: 2 MB, level 2: 512 KB, …). Total extra memory: ~10 MB for 1M leaf tree.

The buffer is reused across levels, so only one allocation per insert_many() call.

Debugging Parallel Performance

Check Rayon Thread Count

println!("Rayon threads: {}", rayon::current_num_threads());

Measure Parallel Efficiency

let start = std::time::Instant::now();
tree.insert_many(&batch)?;
let elapsed = start.elapsed();

let sequential_time = /* benchmark without `parallel` */;
let speedup = sequential_time.as_secs_f64() / elapsed.as_secs_f64();
println!("Speedup: {:.2}x", speedup);

Profile with `perf`

# Record parallel execution
perf record -g cargo bench --features parallel

# View flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

Look for:

Time in rayon::iter::plumbing::bridge (parallelization overhead)
Time in hash function (actual work)

Best Practices

Large Batches

Parallelism only helps for batches > 1k leaves. Use insert_many() with large batches.

Tune Threshold

Set ROTORTREE_PARALLEL_THRESHOLD based on core count:

2-4 cores: 1024 (default)
8+ cores: 512
16+ cores: 256

Profile First

Measure whether parallelism helps YOUR workload. Small trees may be slower with parallel.

Combine Features

Use concurrent + parallel for multi-threaded workloads. Each thread parallelizes internally.

Limitations

Single Inserts Don’t Parallelize

insert() is always sequential:

pub fn insert(&self, leaf: Hash) -> Result<Hash, TreeError> {
    Self::_insert(&mut self.inner.write(), &self.hasher, leaf)  // No parallelism
}

Why: Only one parent per level is computed → no work to parallelize. Solution: Batch into insert_many().

Parallel Overhead for Small Batches

For batches < 1k leaves:

Thread creation overhead > hash savings
Sequential code is faster

Mitigation: Threshold mechanism disables parallelism automatically.

No Proof Parallelization

Proof generation is not parallelized (yet):

let proof = snap.generate_proof(index)?;  // Sequential

Workaround: Parallelize across multiple proofs:

use rayon::prelude::*;

let proofs: Vec<_> = indices
    .par_iter()
    .map(|&i| snap.generate_proof(i))
    .collect::<Result<_, _>>()?;

This works because snapshots are Sync.

Future Optimizations

Vectorized hashing – SIMD to hash multiple parent groups in parallel (Blake3 supports this)
Parallel proof generation – Compute sibling hashes concurrently
Custom Rayon pool – Dedicated pool for RotorTree to avoid contention with app workloads
Adaptive threshold – Auto-tune based on observed batch sizes and core count

Concurrency – Thread-safe insertions via RwLock
WAL & Checkpointing – Background checkpoint thread
Performance Tuning – Holistic optimization strategies

Get Started

Core Concepts

Guides

Advanced

Overview

Enabling the Feature

When Parallelization Kicks In

Parallelization Strategy

Sequential Baseline

Parallel Approach

Implementation Details

Work Distribution

Buffer Management

Array Splitting for Mutation

Parallel Root Recomputation

Performance Benchmarks

Insert Many (1M leaves, N=4, MAX_DEPTH=32)

Threshold Impact (4 cores)

Combining with `concurrent`

Rayon Global Thread Pool

Memory Overhead

Debugging Parallel Performance

Check Rayon Thread Count

Measure Parallel Efficiency

Profile with `perf`

Best Practices

Large Batches

Tune Threshold

Profile First

Combine Features

Limitations

Single Inserts Don’t Parallelize

Parallel Overhead for Small Batches

No Proof Parallelization

Future Optimizations

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Overview

​Enabling the Feature

​When Parallelization Kicks In

​Parallelization Strategy

​Sequential Baseline

​Parallel Approach

​Implementation Details

​Work Distribution

​Buffer Management

​Array Splitting for Mutation

​Parallel Root Recomputation

​Performance Benchmarks

​Insert Many (1M leaves, N=4, MAX_DEPTH=32)

​Threshold Impact (4 cores)

​Combining with concurrent

​Rayon Global Thread Pool

​Memory Overhead

​Debugging Parallel Performance

​Check Rayon Thread Count

​Measure Parallel Efficiency

​Profile with perf

​Best Practices

Large Batches

Tune Threshold

Profile First

Combine Features

​Limitations

​Single Inserts Don’t Parallelize

​Parallel Overhead for Small Batches

​No Proof Parallelization

​Future Optimizations

​Related Topics

Build docs developers (and LLMs) love

Overview

Enabling the Feature

When Parallelization Kicks In

Parallelization Strategy

Sequential Baseline

Parallel Approach

Implementation Details

Work Distribution

Buffer Management

Array Splitting for Mutation

Parallel Root Recomputation

Performance Benchmarks

Insert Many (1M leaves, N=4, MAX_DEPTH=32)

Threshold Impact (4 cores)

Combining with `concurrent`

Rayon Global Thread Pool

Memory Overhead

Debugging Parallel Performance

Check Rayon Thread Count

Measure Parallel Efficiency

Profile with `perf`

Best Practices

Limitations

Single Inserts Don’t Parallelize

Parallel Overhead for Small Batches

No Proof Parallelization

Future Optimizations

Related Topics