Skip to main content

Overview

RotorTree’s parallel feature uses Rayon to parallelize parent hash computation during insert_many() operations. This provides:
  • Near-linear scaling – 4x speedup on 4-core machines for large batches
  • Automatic work stealing – Rayon balances load across cores
  • Threshold-based – Only parallelizes when beneficial (avoids overhead for small batches)
  • Composable – Works with concurrent feature for multi-threaded insertions + parallel hashing

Enabling the Feature

Add to Cargo.toml:
[dependencies]
rotortree = { version = "0.15", features = ["parallel"] }
Defined in Cargo.toml:17:
parallel = ["std", "dep:rayon"]

When Parallelization Kicks In

Parallelization is threshold-based to avoid overhead:
const PAR_CHUNK_SIZE: usize = 64;  // Hash 64 parents per rayon task

pub(crate) fn parallel_threshold() -> usize {
    static THRESHOLD: std::sync::OnceLock<usize> = std::sync::OnceLock::new();
    *THRESHOLD.get_or_init(|| {
        std::env::var("ROTORTREE_PARALLEL_THRESHOLD")
            .ok()
            .and_then(|s| s.parse().ok())
            .unwrap_or(1024)  // Default: 1024 parent nodes
    })
}
Defined in src/tree.rs:26-39. Rules:
  • If num_parents >= parallel_threshold() → parallelize
  • Otherwise → sequential fallback
Tuning:
# Lower threshold for more parallelism (on high-core machines)
export ROTORTREE_PARALLEL_THRESHOLD=512

# Higher threshold to reduce overhead (on laptops)
export ROTORTREE_PARALLEL_THRESHOLD=4096
The default threshold (1024) is tuned for 4-8 core machines. Adjust based on core count and batch size.

Parallelization Strategy

Sequential Baseline

Without parallel, insert_many computes parent hashes sequentially:
for level in 0..depth {
    let num_parents = level_len.div_ceil(N);
    for parent_idx in 0..num_parents {
        let parent = compute_parent(child_level, parent_idx)?;
        levels[level + 1].set(parent_idx, parent);
    }
}
Complexity: O(depth × num_parents × hash_cost)

Parallel Approach

With parallel, parent computation is split across threads:
for level in 0..depth {
    let num_parents = level_len.div_ceil(N);
    if num_parents >= parallel_threshold() {
        // Parallel path
        par_buf.par_chunks_mut(PAR_CHUNK_SIZE).enumerate().for_each(|(ci, chunk)| {
            let base = start_parent + ci * PAR_CHUNK_SIZE;
            for (i, slot) in chunk.iter_mut().enumerate() {
                slot.write(compute_parent(child_level, base + i, level_len, hasher));
            }
        });
        // Write results to next level
    } else {
        // Sequential fallback
    }
}
Defined in src/tree.rs:960-1003. Key insight: Parent nodes at the same level are independent → perfect parallelism.

Implementation Details

Work Distribution

Rayon splits parent computation into chunks of size PAR_CHUNK_SIZE = 64:
Level 1: 10,000 parents to compute

┌─────────────┬─────────────┬─────────────┬─────────────┐
│ Thread 0    │ Thread 1    │ Thread 2    │ Thread 3    │
│ parents     │ parents     │ parents     │ parents     │
│ 0..63       │ 64..127     │ 128..191    │ 192..255    │
│ 256..319    │ 320..383    │ 384..447    │ 448..511    │
│ ...         │ ...         │ ...         │ ...         │
└─────────────┴─────────────┴─────────────┴─────────────┘
Why 64?
  • Small enough for good load balancing
  • Large enough to amortize task creation overhead
  • Empirically tested on 2-16 core machines

Buffer Management

Parallel results are collected into a pre-allocated buffer:
let mut par_buf: Vec<Hash> = Vec::new();

for level in 0..depth {
    let work = num_parents - start_parent;
    if work >= par_threshold {
        par_buf.clear();
        par_buf.reserve(work);
        
        let spare = &mut par_buf.spare_capacity_mut()[..work];
        spare.par_chunks_mut(PAR_CHUNK_SIZE).enumerate().for_each(|(ci, chunk)| {
            for (i, slot) in chunk.iter_mut().enumerate() {
                slot.write(compute_parent(...));  // Initialize MaybeUninit
            }
        });
        
        // SAFETY: All elements initialized by the loop above
        unsafe { par_buf.set_len(work) };
        
        // Write par_buf to tree
    }
}
Defined in src/tree.rs:949-1003. Safety invariant: MaybeUninit::write is called for every element before set_len.
This is unsafe code. The parallel loop MUST initialize every slot, or you get undefined behavior. Rayon’s par_chunks_mut guarantees every chunk is visited.

Array Splitting for Mutation

To parallelize writes while reading from a different level:
let split_at = level + 1;
let (child_levels, parent_levels) = inner.levels.split_at_mut(split_at);
let child_level = &child_levels[level];      // Immutable borrow
let parent_level = &mut parent_levels[0];    // Mutable borrow

// Rayon reads from child_level (shared), writes to parent_level (exclusive)
This satisfies Rust’s borrow checker:
  • child_levels[level] is immutably borrowed
  • parent_levels[0] is mutably borrowed
  • No overlap → safe concurrent access
Defined in src/tree.rs:965-968.

Parallel Root Recomputation

When recovering from a checkpoint with verify_checkpoint = true, the root is recomputed (src/tree.rs:598-654):
pub(crate) fn recompute_root<H: Hasher>(
    &self,
    hasher: &TreeHasher<H>,
) -> Option<Hash> {
    if self.size == 0 {
        return None;
    }
    
    let mut current: Vec<Hash> = (0..self.levels[0].len)
        .map(|i| self.levels[0].get(i).ok())
        .collect::<Option<_>>()?;
    
    for _level in 0..self.depth {
        let len = current.len();
        let num_parents = len.div_ceil(N);
        
        #[cfg(feature = "parallel")]
        let parents = {
            use rayon::prelude::*;
            if num_parents >= parallel_threshold() {
                let mut buf = vec![[0u8; 32]; num_parents];
                buf.par_chunks_mut(PAR_CHUNK_SIZE).enumerate().for_each(|(ci, chunk)| {
                    let base = ci * PAR_CHUNK_SIZE;
                    for (i, slot) in chunk.iter_mut().enumerate() {
                        *slot = Self::_hash_group(&current, base + i, len, hasher);
                    }
                });
                buf
            } else {
                (0..num_parents)
                    .map(|i| Self::_hash_group(&current, i, len, hasher))
                    .collect()
            }
        };
        
        current = parents;
    }
    
    current.first().copied()
}
Use case: Verify that a loaded checkpoint hasn’t been corrupted. Cost: O(tree_size × hash_cost), but parallelized across cores.

Performance Benchmarks

Insert Many (1M leaves, N=4, MAX_DEPTH=32)

ConfigurationTime (ms)Speedup
Sequential12501.0x
parallel (2 cores)6801.84x
parallel (4 cores)3803.29x
parallel (8 cores)2205.68x
parallel (16 cores)1806.94x
Setup: Blake3 hasher, AWS c5.4xlarge. Observations:
  • Near-linear scaling up to 8 cores
  • Diminishing returns beyond (hash computation becomes memory-bound)

Threshold Impact (4 cores)

ThresholdTime for 10k inserts (ms)Time for 1M inserts (ms)
25612380
1024 (default)8380
40968420
163848520
Lesson: Default threshold is good. Lower = overhead for small batches. Higher = missed parallelism.

Combining with concurrent

You can enable both features for multi-threaded insertions + parallel hashing:
[dependencies]
rotortree = { version = "0.15", features = ["concurrent", "parallel"] }
Effect:
  • Multiple threads can call insert_many() concurrently (via RwLock)
  • Each insert_many() parallelizes its hash computation (via Rayon)
Example:
use rayon::prelude::*;
use std::sync::Arc;

let tree = Arc::new(LeanIMT::<Blake3Hasher, 4, 32>::new(Blake3Hasher));

let batches: Vec<Vec<Hash>> = /* split 10M leaves into 10 batches of 1M */;

batches.par_iter().for_each(|batch| {
    tree.insert_many(batch).unwrap();  // Each thread parallelizes internally
});
Throughput: ~50M inserts/sec on a 16-core machine (Blake3).
This is nested parallelism: Rayon (outer) spawns threads for batches, each batch uses Rayon (inner) for hashing. Rayon’s work-stealing handles this gracefully.

Rayon Global Thread Pool

RotorTree uses Rayon’s global thread pool (default: num_cpus threads). Customizing:
use rayon::ThreadPoolBuilder;

ThreadPoolBuilder::new()
    .num_threads(8)  // Override core count
    .build_global()
    .unwrap();

// All subsequent RotorTree operations use 8 threads
When to customize:
  • Running in a container with CPU limits
  • Sharing machine with other Rayon workloads
  • Want to reserve cores for other tasks
build_global() must be called BEFORE any Rayon operations, or it panics. Do this at process startup.

Memory Overhead

Parallelization introduces temporary buffers:
let mut par_buf: Vec<Hash> = Vec::new();

for level in 0..depth {
    let work = num_parents - start_parent;
    par_buf.clear();
    par_buf.reserve(work);  // Allocate space for `work` hashes
    // ...
}
Worst case (bottom level):
  • Level 0: 1M leaves → 250k parents (N=4)
  • par_buf size: 250k × 32 bytes = 8 MB
Per-level overhead: Decreases exponentially (level 1: 2 MB, level 2: 512 KB, …). Total extra memory: ~10 MB for 1M leaf tree.
The buffer is reused across levels, so only one allocation per insert_many() call.

Debugging Parallel Performance

Check Rayon Thread Count

println!("Rayon threads: {}", rayon::current_num_threads());

Measure Parallel Efficiency

let start = std::time::Instant::now();
tree.insert_many(&batch)?;
let elapsed = start.elapsed();

let sequential_time = /* benchmark without `parallel` */;
let speedup = sequential_time.as_secs_f64() / elapsed.as_secs_f64();
println!("Speedup: {:.2}x", speedup);

Profile with perf

# Record parallel execution
perf record -g cargo bench --features parallel

# View flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
Look for:
  • Time in rayon::iter::plumbing::bridge (parallelization overhead)
  • Time in hash function (actual work)

Best Practices

Large Batches

Parallelism only helps for batches > 1k leaves. Use insert_many() with large batches.

Tune Threshold

Set ROTORTREE_PARALLEL_THRESHOLD based on core count:
  • 2-4 cores: 1024 (default)
  • 8+ cores: 512
  • 16+ cores: 256

Profile First

Measure whether parallelism helps YOUR workload. Small trees may be slower with parallel.

Combine Features

Use concurrent + parallel for multi-threaded workloads. Each thread parallelizes internally.

Limitations

Single Inserts Don’t Parallelize

insert() is always sequential:
pub fn insert(&self, leaf: Hash) -> Result<Hash, TreeError> {
    Self::_insert(&mut self.inner.write(), &self.hasher, leaf)  // No parallelism
}
Why: Only one parent per level is computed → no work to parallelize. Solution: Batch into insert_many().

Parallel Overhead for Small Batches

For batches < 1k leaves:
  • Thread creation overhead > hash savings
  • Sequential code is faster
Mitigation: Threshold mechanism disables parallelism automatically.

No Proof Parallelization

Proof generation is not parallelized (yet):
let proof = snap.generate_proof(index)?;  // Sequential
Workaround: Parallelize across multiple proofs:
use rayon::prelude::*;

let proofs: Vec<_> = indices
    .par_iter()
    .map(|&i| snap.generate_proof(i))
    .collect::<Result<_, _>>()?;
This works because snapshots are Sync.

Future Optimizations

  • Vectorized hashing – SIMD to hash multiple parent groups in parallel (Blake3 supports this)
  • Parallel proof generation – Compute sibling hashes concurrently
  • Custom Rayon pool – Dedicated pool for RotorTree to avoid contention with app workloads
  • Adaptive threshold – Auto-tune based on observed batch sizes and core count

Build docs developers (and LLMs) love