Small Value Optimizations (CompactPolynomial)

CompactPolynomial is a specialized polynomial type that stores coefficients as small scalars (u8, u16, u32, u64, u128, i64, i128) instead of full field elements. This optimization provides 10-100x memory reduction and significant speedups in the sumcheck inner loop.

Problem: Field Element Overhead

Jolt uses BN254’s scalar field (Fr), where each element is 32 bytes:

// Standard polynomial representation
struct DensePolynomial<F: JoltField> {
    coeffs: Vec<F>,  // Each F is 32 bytes (BN254 Fr)
}

// Memory usage for 2^20 coefficients:
// 1,048,576 * 32 bytes = 33.5 MB per polynomial

But many polynomials in Jolt have small integer coefficients:

Instruction flags: 0 or 1 (boolean)
Memory addresses: 0 to 2^32 - 1 (u32)
Register indices: 0 to 31 (u8)
Lookup indices: small integers

Solution: Compact Representation

CompactPolynomial<T, F> stores coefficients as small type T and lazily converts to field elements:

pub struct CompactPolynomial<T: SmallScalar, F: JoltField> {
    num_vars: usize,
    len: usize,
    coeffs: Vec<T>,              // Small scalars (1-16 bytes each)
    bound_coeffs: Vec<F>,        // Field elements (populated after binding)
}

Memory Savings

Coefficient Type	Bytes/Element	Memory for 2^20 coeffs	vs. Field Element
`bool`	1	1 MB	32x reduction
`u8`	1	1 MB	32x reduction
`u16`	2	2 MB	16x reduction
`u32`	4	4 MB	8x reduction
`u64`	8	8 MB	4x reduction
`u128` / `i128`	16	16 MB	2x reduction
Field element `F`	32	32 MB	1x (baseline)

Implementation Details

Small Scalar Trait

Location: jolt-core/src/utils/small_scalar.rs

pub trait SmallScalar: Copy + Send + Sync + PartialOrd {
    fn to_field<F: JoltField>(&self) -> F;
    
    // Optimized: (b - a) * r without intermediate field conversion
    fn diff_mul_field<F: JoltField>(&self, other: Self, r: F) -> F;
    
    // Optimized: small_scalar * field_element
    fn field_mul<F: JoltField>(&self, r: F) -> F;
}

Two-Phase Representation

CompactPolynomial has two states: 1. Initial State (coeffs only):

let poly = CompactPolynomial::from_coeffs(vec![1u8, 2, 3, 4, ...]);
// coeffs: [1, 2, 3, 4, ...] (u8 vector)
// bound_coeffs: [] (empty)

2. Bound State (after first bind):

poly.bind(r, BindingOrder::LowToHigh);
// coeffs: [1, 2, 3, 4, ...] (unchanged, but no longer used)
// bound_coeffs: [f1, f2, ...] (field elements)

After the first bind(), subsequent operations use bound_coeffs.

Optimized Binding

The first bind() converts small scalars to field elements using optimized arithmetic:

impl<T: SmallScalar, F: JoltField> PolynomialBinding<F> for CompactPolynomial<T, F> {
    fn bind(&mut self, r: F::Challenge, order: BindingOrder) {
        let n = self.len() / 2;
        if self.is_bound() {
            // Already bound: use field arithmetic
            // Standard bind: bound_coeffs[i] = bound_coeffs[2*i] + r * (bound_coeffs[2*i+1] - bound_coeffs[2*i])
        } else {
            // First bind: convert small scalars to field with optimizations
            self.bound_coeffs = (0..n)
                .map(|i| {
                    let a = self.coeffs[2 * i];
                    let b = self.coeffs[2 * i + 1];
                    match a.cmp(&b) {
                        Ordering::Equal => a.to_field(),
                        // a < b: Compute a + r * (b - a) without full field conversion
                        Ordering::Less => {
                            a.to_field::<F>() + b.diff_mul_field::<F>(a, r.into())
                        }
                        // a > b: Compute a - r * (a - b)
                        Ordering::Greater => {
                            a.to_field::<F>() - a.diff_mul_field::<F>(b, r.into())
                        }
                    }
                })
                .collect();
        }
        self.num_vars -= 1;
        self.len = n;
    }
}

Key Optimizations

1. Lazy Conversion

Coefficients stay as small scalars until binding:

// No field conversion during initialization
let poly = CompactPolynomial::from_coeffs(coeffs);

// Conversion happens during first bind
poly.bind(r, order);  // coeffs[T] → bound_coeffs[F]

2. Optimized Arithmetic

// Standard: Two field conversions + one field multiplication
let result = a.to_field() + r * (b.to_field() - a.to_field());

// Optimized: One conversion + one small-scalar multiplication
let result = a.to_field() + b.diff_mul_field(a, r);

3. Special Case Detection

match a.cmp(&b) {
    Ordering::Equal => a.to_field(),  // No multiplication needed!
    Ordering::Less => /* ... */,
    Ordering::Greater => /* ... */,
}

For boolean polynomials (many coefficients are equal), this eliminates ~50% of multiplications.

4. Parallel Binding

fn bind_parallel(&mut self, r: F::Challenge, order: BindingOrder) {
    // Use rayon for parallel processing
    self.bound_coeffs = (0..n)
        .into_par_iter()
        .map(|i| /* optimized bind operation */)
        .collect();
}

Parallelization threshold: polynomials with ≥ 512 * 32 / F::NUM_BYTES elements.

Evaluation Optimizations

Inside-Out Evaluation

Faster polynomial evaluation based on Fast Polynomial Evaluation:

pub fn inside_out_evaluate(&self, r: &[F]) -> F {
    const PARALLEL_THRESHOLD: usize = 16;
    if r.len() < PARALLEL_THRESHOLD {
        self.inside_out_serial(r)
    } else {
        self.inside_out_parallel(r)
    }
}

fn inside_out_serial(&self, r: &[F]) -> F {
    // Convert coeffs to field elements once
    let mut current: Vec<F> = self.coeffs.iter().map(|&c| c.to_field()).collect();
    let m = r.len();
    for i in (0..m).rev() {
        let stride = 1 << i;
        let r_val = r[m - 1 - i];
        for j in 0..stride {
            let f0 = current[j];
            let f1 = current[j + stride];
            let slope = f1 - f0;
            // Optimized: check for zero/one slope
            current[j] = if slope.is_zero() {
                f0
            } else if slope.is_one() {
                f0 + r_val
            } else {
                f0 + slope * r_val
            };
        }
    }
    current[0]
}

Speedup: ~2x vs. standard evaluation

Split EQ Evaluation

Optimized evaluation when combined with EQ polynomial:

pub fn split_eq_evaluate(&self, r_len: usize, eq_one: &[F], eq_two: &[F]) -> F {
    // Computes: Σᵢⱼ coeffs[i*|eq_two| + j] * eq_one[i] * eq_two[j]
    (0..eq_one.len())
        .into_par_iter()
        .map(|x1| {
            let partial_sum = (0..eq_two.len())
                .map(|x2| {
                    let idx = x1 * eq_two.len() + x2;
                    self.coeffs[idx].field_mul(eq_two[x2])  // Optimized small * field
                })
                .sum();
            OptimizedMul::mul_01_optimized(partial_sum, eq_one[x1])
        })
        .sum()
}

This avoids materializing the full EQ polynomial evaluation.

Usage in Jolt

MultilinearPolynomial Dispatch

Location: jolt-core/src/poly/multilinear_polynomial.rs

pub enum MultilinearPolynomial<F: JoltField> {
    LargeScalars(DensePolynomial<F>),      // Full field elements
    BoolScalars(CompactPolynomial<bool, F>),
    U8Scalars(CompactPolynomial<u8, F>),
    U16Scalars(CompactPolynomial<u16, F>),
    U32Scalars(CompactPolynomial<u32, F>),
    U64Scalars(CompactPolynomial<u64, F>),
    U128Scalars(CompactPolynomial<u128, F>),
    I64Scalars(CompactPolynomial<i64, F>),
    I128Scalars(CompactPolynomial<i128, F>),
    S128Scalars(CompactPolynomial<S128, F>),  // Signed 128-bit
    // ... specialized variants
}

Witness Polynomials

Many witness polynomials use compact representation:

// Instruction flags (boolean)
let flags_poly = MultilinearPolynomial::BoolScalars(
    CompactPolynomial::from_coeffs(flags)
);

// Memory addresses (u64)
let addr_poly = MultilinearPolynomial::U64Scalars(
    CompactPolynomial::from_coeffs(addresses)
);

// Register indices (u8)
let reg_poly = MultilinearPolynomial::U8Scalars(
    CompactPolynomial::from_coeffs(register_indices)
);

Performance Impact

Sumcheck Inner Loop

The sumcheck inner loop dominates prover time:

// Without CompactPolynomial:
for i in 0..poly.len() {
    sum += poly.coeffs[i] * eq_table[i];  // 32-byte load + field mul
}

// With CompactPolynomial:
for i in 0..poly.len() {
    sum += poly.coeffs[i].field_mul(eq_table[i]);  // 1-16 byte load + optimized mul
}

Impact:

Memory bandwidth: 2-32x reduction in bytes read
Cache efficiency: More coefficients fit in L1/L2 cache
Arithmetic: Faster small-scalar multiplication

Concrete Measurements

For a program with 2^20 cycles:

Polynomial	Type	Memory (Dense)	Memory (Compact)	Savings
Instruction flags	bool	32 MB	1 MB	32x
Memory addresses	u64	32 MB	8 MB	4x
Operand values	u64	32 MB	8 MB	4x
Total (10 polys)	—	320 MB	60 MB	5.3x

Prover speedup: 15-30% from reduced memory bandwidth and cache pressure.

Best Practices

When to Use CompactPolynomial

✅ Use CompactPolynomial when:

Coefficients are small integers (< 2^64)
Polynomial is large (> 2^16 coefficients)
Memory is constrained
Polynomial is bound multiple times in sumcheck

❌ Use DensePolynomial when:

Coefficients are arbitrary field elements
Polynomial is very small (< 2^10 coefficients)
Polynomial is evaluated but never bound

Choosing Scalar Type

Pick the smallest type that fits your values:

// Booleans (0/1)
CompactPolynomial::<bool, F>

// Small indices (0-255)
CompactPolynomial::<u8, F>

// Memory addresses (0 to 2^32-1)
CompactPolynomial::<u32, F>

// Full 64-bit values
CompactPolynomial::<u64, F>

// Signed values
CompactPolynomial::<i64, F>

Performance Checklist

Use smallest scalar type that fits values
Pre-allocate vectors for bound_coeffs using unsafe_allocate_zero_vec
Avoid clones in hot paths (use references)
Profile to verify memory savings translate to speedup

EQ Optimizations: Fast EQ polynomial evaluation
Batched Sumcheck: Protocol using polynomial binding
SharedRaPolynomials: Share EQ tables across polynomials

References

Implementation: jolt-core/src/poly/compact_polynomial.rs
Fast Polynomial Evaluation — Inside-out evaluation technique

Overview

Core Components

Protocols

Optimizations

Theory

Small Value Optimizations (CompactPolynomial)

Problem: Field Element Overhead

Solution: Compact Representation

Memory Savings

Implementation Details

Small Scalar Trait

Two-Phase Representation

Optimized Binding

Key Optimizations

1. Lazy Conversion

2. Optimized Arithmetic

3. Special Case Detection

4. Parallel Binding

Evaluation Optimizations

Inside-Out Evaluation

Split EQ Evaluation

Usage in Jolt

MultilinearPolynomial Dispatch

Witness Polynomials

Performance Impact

Sumcheck Inner Loop

Concrete Measurements

Best Practices

When to Use CompactPolynomial

Choosing Scalar Type

Performance Checklist

References

Build docs developers (and LLMs) love

Overview

Core Components

Protocols

Optimizations

Theory

​Problem: Field Element Overhead

​Solution: Compact Representation

​Memory Savings

​Implementation Details

​Small Scalar Trait

​Two-Phase Representation

​Optimized Binding

​Key Optimizations

​1. Lazy Conversion

​2. Optimized Arithmetic

​3. Special Case Detection

​4. Parallel Binding

​Evaluation Optimizations

​Inside-Out Evaluation

​Split EQ Evaluation

​Usage in Jolt

​MultilinearPolynomial Dispatch

​Witness Polynomials

​Performance Impact

​Sumcheck Inner Loop

​Concrete Measurements

​Best Practices

​When to Use CompactPolynomial

​Choosing Scalar Type

​Performance Checklist

​Related Optimizations

​References

Build docs developers (and LLMs) love

Problem: Field Element Overhead

Solution: Compact Representation

Memory Savings

Implementation Details

Small Scalar Trait

Two-Phase Representation

Optimized Binding

Key Optimizations

1. Lazy Conversion

2. Optimized Arithmetic

3. Special Case Detection

4. Parallel Binding

Evaluation Optimizations

Inside-Out Evaluation

Split EQ Evaluation

Usage in Jolt

MultilinearPolynomial Dispatch

Witness Polynomials

Performance Impact

Sumcheck Inner Loop

Concrete Measurements

Best Practices

When to Use CompactPolynomial

Choosing Scalar Type

Performance Checklist

Related Optimizations

References