Skip to main content

Performance Optimization

RISC Zero’s zkVM is designed to act like a physical CPU, allowing you to use general-purpose optimization techniques alongside zkVM-specific strategies. This guide covers practical approaches to optimize guest program performance.

Understanding the zkVM

What Is the zkVM?

The zkVM is essentially a CPU implementation of the RISC-V architecture (specifically riscv32im). The key difference from physical CPUs is that it’s implemented with arithmetic circuits in software rather than silicon.

What Is a Cycle?

Proving times for the zkVM are directly related to the number of cycles in an execution. A “clock cycle” represents one tick of the CPU’s internal clock and the time it takes to execute a basic CPU operation.
Proving times for the zkVM are directly related to the number of cycles in an execution.

General Optimization Techniques

Don’t Assume, Measure

Performance is complex. Don’t assume you know what the bottlenecks are—measure and experiment.
Optimizing a function that only takes 1% of execution time will yield less than 1% improvement overall. This is known as Amdahl’s Law.

Measuring with Console Output

The simplest way to measure performance is using env::cycle_count() with eprintln!:
methods/guest/src/main.rs
use risc0_zkvm::guest::env;

fn my_operation_to_measure() {
    let start = env::cycle_count();
    
    // Potentially expensive or frequently called code
    // ...
    
    let end = env::cycle_count();
    eprintln!("my_operation_to_measure: {}", end - start);
}
You can analyze the output using tools like counts.

Profiling

For comprehensive performance analysis, use profiling tools. See the Profiling Guide for detailed instructions on using pprof to generate flamegraphs and analyze cycle counts.

Key Differences from Physical CPUs

Most RISC-V Operations Take Exactly One Cycle

In the zkVM, the relative difference between instructions is much smaller than on physical CPUs:
  • 1 cycle: Addition, comparison, jump, shift left, load, store, multiply
  • 2 cycles: Bitwise operations (AND, OR, XOR), division, remainder, shift right
On physical CPUs, division takes 15-40x longer than addition. In the zkVM, division only takes 2x longer. Choose simpler algorithms when possible.

Memory Access Costs One Cycle (Except Paging)

Memory loads and stores typically take exactly one cycle—extremely fast compared to physical CPUs where L1 cache takes 3-4 cycles and main memory takes 100-150 cycles.

Understanding Paging

Pages in the zkVM are 1 KB chunks of memory. The first time a page is accessed in a segment, it must be paged-in (loaded from the host with Merkle proof verification). Modified pages must be paged-out at segment end.
A page-in or page-out operation takes between 1,094 and 5,130 cycles (1,130 cycles on average).
Optimization strategies:
  • Reduce memory usage
  • Use sequential access patterns instead of random access
  • Condense the range of addresses accessed
  • Similar to optimizing for L1/L2 cache on physical CPUs

No Native Floating Point Operations

The zkVM does not implement RISC-V floating point instructions. All floating point operations are emulated in software, taking 60-140 cycles for basic operations.
When possible, use integers instead of floating point numbers. Consider fixed-point arithmetic for precision requirements.

Unaligned Data Access Is Expensive

Memory is always read and stored as 32-bit (4-byte) words. Unaligned access (addresses not multiples of 4) is much more expensive:
  • Aligned read: 1 cycle
  • Unaligned read: 12 cycles
All allocations are aligned by default. If using structs with small fields (bool, u8, i16) that are accessed frequently, pay attention to field alignment. When slicing byte arrays, try to do so at word-aligned indices.

No Pipelining or Instruction-Level Parallelism

The zkVM has a simple architecture with no:
  • Execution pipelines
  • Superscalar execution
  • Out-of-order execution
  • Speculative execution
Techniques like pre-fetching, branch avoidance, or instruction reordering have essentially no effect in the zkVM.

Data Input/Output Optimization

Reading Raw Bytes Efficiently

When working with raw bytes (not structs), use env::read_slice() or env::stdin().read_to_end() to avoid serialization overhead:
Guest code
use std::io::Read;
use risc0_zkvm::guest::env;

let mut input_bytes = Vec::<u8>::new();
env::stdin().read_to_end(&mut input_bytes).unwrap();
Host code
use risc0_zkvm::ExecutorEnv;

let input_bytes: Vec<u8> = b"INPUT DATA".to_vec();
let env = ExecutorEnv::builder()
    .write_slice(&input_bytes)
    .build()
    .unwrap();

Merklizing Large Inputs

If you only need part of the input data, consider splitting it into chunks and building a Merkle tree. The guest can:
  1. Receive the Merkle root as initial input
  2. Load chunks dynamically as needed
  3. Verify Merkle inclusion proofs for authenticity
See the Where’s Waldo example for implementation details.

Cryptography Acceleration

Using Precompiles

RISC Zero provides accelerator circuits for cryptographic operations:
  • SHA-256: ~68 cycles per 64-byte block (6 cycles to initialize)
  • 256-bit modular multiply: ~10 cycles
See the Precompiles Guide for patched cryptographic crates and implementation details.

Concurrency and Async

The zkVM has one core and one thread of execution. Using async routines, locking, or atomic operations in the guest can only slow the program down.

Memory Prefetching Doesn’t Help

All memory operations are synchronous in the zkVM. Memory prefetching techniques used on physical CPUs provide no benefit and may hurt performance.

Quick Wins

1
Profile your application
2
Use profiling tools to identify where cycles are spent.
3
Experiment with compiler settings
4
Try different optimization levels and LTO settings in your Cargo.toml:
5
[profile.release]
lto = "thin"  # Sometimes faster than "fat" or true
opt-level = 2  # Try 2, 3, "s", or "z"
codegen-units = 1
6
Use appropriate data structures
7
When you need a map, use BTreeMap instead of HashMap in guest code.
8
Use precompiled cryptography
9
When hashing data, use the precompile implementation of SHA-256.
10
Eliminate unnecessary copying
11
Look for places where you’re copying or (de)serializing data unnecessarily.

RV32IM Operation Cycle Counts

Here’s a summary of cycle counts for common operations:
Operation CategoryExamplesCycles
ArithmeticADD, SUB, MUL1
Control FlowJAL, BEQ, BNE1
Memory (aligned)LW, SW1*
ShiftsSLL (left)1
ShiftsSRL, SRA (right)2
BitwiseAND, OR, XOR2
DivisionDIV, DIVU, REM2
*Memory operations take 1 cycle if the page is already loaded, or 1,094-5,130 cycles for page-in/page-out operations.

Next Steps

Build docs developers (and LLMs) love