Performance Optimization
RISC Zero’s zkVM is designed to act like a physical CPU, allowing you to use general-purpose optimization techniques alongside zkVM-specific strategies. This guide covers practical approaches to optimize guest program performance.Understanding the zkVM
What Is the zkVM?
The zkVM is essentially a CPU implementation of the RISC-V architecture (specifically riscv32im). The key difference from physical CPUs is that it’s implemented with arithmetic circuits in software rather than silicon.What Is a Cycle?
Proving times for the zkVM are directly related to the number of cycles in an execution. A “clock cycle” represents one tick of the CPU’s internal clock and the time it takes to execute a basic CPU operation.Proving times for the zkVM are directly related to the number of cycles in an execution.
General Optimization Techniques
Don’t Assume, Measure
Performance is complex. Don’t assume you know what the bottlenecks are—measure and experiment.Measuring with Console Output
The simplest way to measure performance is usingenv::cycle_count() with eprintln!:
methods/guest/src/main.rs
counts.
Profiling
For comprehensive performance analysis, use profiling tools. See the Profiling Guide for detailed instructions on using pprof to generate flamegraphs and analyze cycle counts.Key Differences from Physical CPUs
Most RISC-V Operations Take Exactly One Cycle
In the zkVM, the relative difference between instructions is much smaller than on physical CPUs:- 1 cycle: Addition, comparison, jump, shift left, load, store, multiply
- 2 cycles: Bitwise operations (AND, OR, XOR), division, remainder, shift right
Memory Access Costs One Cycle (Except Paging)
Memory loads and stores typically take exactly one cycle—extremely fast compared to physical CPUs where L1 cache takes 3-4 cycles and main memory takes 100-150 cycles.Understanding Paging
Pages in the zkVM are 1 KB chunks of memory. The first time a page is accessed in a segment, it must be paged-in (loaded from the host with Merkle proof verification). Modified pages must be paged-out at segment end.A page-in or page-out operation takes between 1,094 and 5,130 cycles (1,130 cycles on average).
- Reduce memory usage
- Use sequential access patterns instead of random access
- Condense the range of addresses accessed
- Similar to optimizing for L1/L2 cache on physical CPUs
No Native Floating Point Operations
The zkVM does not implement RISC-V floating point instructions. All floating point operations are emulated in software, taking 60-140 cycles for basic operations.Unaligned Data Access Is Expensive
Memory is always read and stored as 32-bit (4-byte) words. Unaligned access (addresses not multiples of 4) is much more expensive:- Aligned read: 1 cycle
- Unaligned read: 12 cycles
No Pipelining or Instruction-Level Parallelism
The zkVM has a simple architecture with no:- Execution pipelines
- Superscalar execution
- Out-of-order execution
- Speculative execution
Techniques like pre-fetching, branch avoidance, or instruction reordering have essentially no effect in the zkVM.
Data Input/Output Optimization
Reading Raw Bytes Efficiently
When working with raw bytes (not structs), useenv::read_slice() or env::stdin().read_to_end() to avoid serialization overhead:
Guest code
Host code
Merklizing Large Inputs
If you only need part of the input data, consider splitting it into chunks and building a Merkle tree. The guest can:- Receive the Merkle root as initial input
- Load chunks dynamically as needed
- Verify Merkle inclusion proofs for authenticity
Cryptography Acceleration
Using Precompiles
RISC Zero provides accelerator circuits for cryptographic operations:- SHA-256: ~68 cycles per 64-byte block (6 cycles to initialize)
- 256-bit modular multiply: ~10 cycles
Concurrency and Async
Memory Prefetching Doesn’t Help
All memory operations are synchronous in the zkVM. Memory prefetching techniques used on physical CPUs provide no benefit and may hurt performance.Quick Wins
Use profiling tools to identify where cycles are spent.
[profile.release]
lto = "thin" # Sometimes faster than "fat" or true
opt-level = 2 # Try 2, 3, "s", or "z"
codegen-units = 1
When hashing data, use the precompile implementation of SHA-256.
RV32IM Operation Cycle Counts
Here’s a summary of cycle counts for common operations:| Operation Category | Examples | Cycles |
|---|---|---|
| Arithmetic | ADD, SUB, MUL | 1 |
| Control Flow | JAL, BEQ, BNE | 1 |
| Memory (aligned) | LW, SW | 1* |
| Shifts | SLL (left) | 1 |
| Shifts | SRL, SRA (right) | 2 |
| Bitwise | AND, OR, XOR | 2 |
| Division | DIV, DIVU, REM | 2 |
Next Steps
- Learn about Profiling guest programs
- Explore Precompiles for cryptographic acceleration
- Understand GPU Acceleration for faster proving
- Read about Recursive Proving for scalability