Performance Optimization

RISC Zero’s zkVM is designed to act like a physical CPU, allowing you to use general-purpose optimization techniques alongside zkVM-specific strategies. This guide covers practical approaches to optimize guest program performance.

Understanding the zkVM

What Is the zkVM?

The zkVM is essentially a CPU implementation of the RISC-V architecture (specifically riscv32im). The key difference from physical CPUs is that it’s implemented with arithmetic circuits in software rather than silicon.

What Is a Cycle?

Proving times for the zkVM are directly related to the number of cycles in an execution. A “clock cycle” represents one tick of the CPU’s internal clock and the time it takes to execute a basic CPU operation.

Proving times for the zkVM are directly related to the number of cycles in an execution.

General Optimization Techniques

Don’t Assume, Measure

Performance is complex. Don’t assume you know what the bottlenecks are—measure and experiment.

Optimizing a function that only takes 1% of execution time will yield less than 1% improvement overall. This is known as Amdahl’s Law.

Measuring with Console Output

The simplest way to measure performance is using env::cycle_count() with eprintln!:

methods/guest/src/main.rs

use risc0_zkvm::guest::env;

fn my_operation_to_measure() {
    let start = env::cycle_count();
    
    // Potentially expensive or frequently called code
    // ...
    
    let end = env::cycle_count();
    eprintln!("my_operation_to_measure: {}", end - start);
}

You can analyze the output using tools like counts.

Profiling

For comprehensive performance analysis, use profiling tools. See the Profiling Guide for detailed instructions on using pprof to generate flamegraphs and analyze cycle counts.

Key Differences from Physical CPUs

Most RISC-V Operations Take Exactly One Cycle

In the zkVM, the relative difference between instructions is much smaller than on physical CPUs:

1 cycle: Addition, comparison, jump, shift left, load, store, multiply
2 cycles: Bitwise operations (AND, OR, XOR), division, remainder, shift right

On physical CPUs, division takes 15-40x longer than addition. In the zkVM, division only takes 2x longer. Choose simpler algorithms when possible.

Memory Access Costs One Cycle (Except Paging)

Memory loads and stores typically take exactly one cycle—extremely fast compared to physical CPUs where L1 cache takes 3-4 cycles and main memory takes 100-150 cycles.

Understanding Paging

Pages in the zkVM are 1 KB chunks of memory. The first time a page is accessed in a segment, it must be paged-in (loaded from the host with Merkle proof verification). Modified pages must be paged-out at segment end.

A page-in or page-out operation takes between 1,094 and 5,130 cycles (1,130 cycles on average).

Optimization strategies:

Reduce memory usage
Use sequential access patterns instead of random access
Condense the range of addresses accessed
Similar to optimizing for L1/L2 cache on physical CPUs

No Native Floating Point Operations

The zkVM does not implement RISC-V floating point instructions. All floating point operations are emulated in software, taking 60-140 cycles for basic operations.

When possible, use integers instead of floating point numbers. Consider fixed-point arithmetic for precision requirements.

Unaligned Data Access Is Expensive

Memory is always read and stored as 32-bit (4-byte) words. Unaligned access (addresses not multiples of 4) is much more expensive:

Aligned read: 1 cycle
Unaligned read: 12 cycles

All allocations are aligned by default. If using structs with small fields (bool, u8, i16) that are accessed frequently, pay attention to field alignment. When slicing byte arrays, try to do so at word-aligned indices.

No Pipelining or Instruction-Level Parallelism

The zkVM has a simple architecture with no:

Execution pipelines
Superscalar execution
Out-of-order execution
Speculative execution

Techniques like pre-fetching, branch avoidance, or instruction reordering have essentially no effect in the zkVM.

Data Input/Output Optimization

Reading Raw Bytes Efficiently

When working with raw bytes (not structs), use env::read_slice() or env::stdin().read_to_end() to avoid serialization overhead:

Guest code

use std::io::Read;
use risc0_zkvm::guest::env;

let mut input_bytes = Vec::<u8>::new();
env::stdin().read_to_end(&mut input_bytes).unwrap();

Host code

use risc0_zkvm::ExecutorEnv;

let input_bytes: Vec<u8> = b"INPUT DATA".to_vec();
let env = ExecutorEnv::builder()
    .write_slice(&input_bytes)
    .build()
    .unwrap();

Merklizing Large Inputs

If you only need part of the input data, consider splitting it into chunks and building a Merkle tree. The guest can:

Receive the Merkle root as initial input
Load chunks dynamically as needed
Verify Merkle inclusion proofs for authenticity

See the Where’s Waldo example for implementation details.

Cryptography Acceleration

Using Precompiles

RISC Zero provides accelerator circuits for cryptographic operations:

SHA-256: ~68 cycles per 64-byte block (6 cycles to initialize)
256-bit modular multiply: ~10 cycles

See the Precompiles Guide for patched cryptographic crates and implementation details.

Concurrency and Async

The zkVM has one core and one thread of execution. Using async routines, locking, or atomic operations in the guest can only slow the program down.

Memory Prefetching Doesn’t Help

All memory operations are synchronous in the zkVM. Memory prefetching techniques used on physical CPUs provide no benefit and may hurt performance.

Quick Wins

Profile your application

Use profiling tools to identify where cycles are spent.

Experiment with compiler settings

Try different optimization levels and LTO settings in your Cargo.toml:

[profile.release]
lto = "thin"  # Sometimes faster than "fat" or true
opt-level = 2  # Try 2, 3, "s", or "z"
codegen-units = 1

Use appropriate data structures

When you need a map, use BTreeMap instead of HashMap in guest code.

Use precompiled cryptography

When hashing data, use the precompile implementation of SHA-256.

Eliminate unnecessary copying

Look for places where you’re copying or (de)serializing data unnecessarily.

RV32IM Operation Cycle Counts

Here’s a summary of cycle counts for common operations:

Operation Category	Examples	Cycles
Arithmetic	ADD, SUB, MUL	1
Control Flow	JAL, BEQ, BNE	1
Memory (aligned)	LW, SW	1*
Shifts	SLL (left)	1
Shifts	SRL, SRA (right)	2
Bitwise	AND, OR, XOR	2
Division	DIV, DIVU, REM	2

*Memory operations take 1 cycle if the page is already loaded, or 1,094-5,130 cycles for page-in/page-out operations.

Next Steps

Learn about Profiling guest programs
Explore Precompiles for cryptographic acceleration
Understand GPU Acceleration for faster proving
Read about Recursive Proving for scalability

Get Started

Core Concepts

zkVM Development

Proving

Advanced

Blockchain Integration

Performance Optimization

Performance Optimization

Understanding the zkVM

What Is the zkVM?

What Is a Cycle?

General Optimization Techniques

Don’t Assume, Measure

Measuring with Console Output

Profiling

Key Differences from Physical CPUs

Most RISC-V Operations Take Exactly One Cycle

Memory Access Costs One Cycle (Except Paging)

Understanding Paging

No Native Floating Point Operations

Unaligned Data Access Is Expensive

No Pipelining or Instruction-Level Parallelism

Data Input/Output Optimization

Reading Raw Bytes Efficiently

Merklizing Large Inputs

Cryptography Acceleration

Using Precompiles

Concurrency and Async

Memory Prefetching Doesn’t Help

Quick Wins

RV32IM Operation Cycle Counts

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

zkVM Development

Proving

Advanced

Blockchain Integration

​Performance Optimization

​Understanding the zkVM

​What Is the zkVM?

​What Is a Cycle?

​General Optimization Techniques

​Don’t Assume, Measure

​Measuring with Console Output

​Profiling

​Key Differences from Physical CPUs

​Most RISC-V Operations Take Exactly One Cycle

​Memory Access Costs One Cycle (Except Paging)

​Understanding Paging

​No Native Floating Point Operations

​Unaligned Data Access Is Expensive

​No Pipelining or Instruction-Level Parallelism

​Data Input/Output Optimization

​Reading Raw Bytes Efficiently

​Merklizing Large Inputs

​Cryptography Acceleration

​Using Precompiles

​Concurrency and Async

​Memory Prefetching Doesn’t Help

​Quick Wins

​RV32IM Operation Cycle Counts

​Next Steps

Build docs developers (and LLMs) love

Performance Optimization

Understanding the zkVM

What Is the zkVM?

What Is a Cycle?

General Optimization Techniques

Don’t Assume, Measure

Measuring with Console Output

Profiling

Key Differences from Physical CPUs

Most RISC-V Operations Take Exactly One Cycle

Memory Access Costs One Cycle (Except Paging)

Understanding Paging

No Native Floating Point Operations

Unaligned Data Access Is Expensive

No Pipelining or Instruction-Level Parallelism

Data Input/Output Optimization

Reading Raw Bytes Efficiently

Merklizing Large Inputs

Cryptography Acceleration

Using Precompiles

Concurrency and Async

Memory Prefetching Doesn’t Help

Quick Wins

RV32IM Operation Cycle Counts

Next Steps