Benchmark suite
The benchmarks/ directory contains a comprehensive benchmark suite comparing Python and Walrus performance across various computational patterns.
Running benchmarks
Run all benchmarks
Navigate to benchmarks directory
Execute the benchmark script
The script will automatically build Walrus in release mode if needed:
Review results
The script outputs timing comparisons for each benchmark, showing which language is faster and by what factor.
Run individual benchmarks
You can run specific benchmarks separately:
Python
Walrus (from project root)
Using built binary
python3 benchmarks/01_quicksort.py
Benchmark categories
Sorting algorithms
Benchmark Description Scale 01_quicksortQuicksort algorithm 5,000 elements 02_bubble_sortO(n²) bubble sort 1,000 elements
Recursion tests
Benchmark Description Scale 03_fibonacci_recursiveNaive recursive Fibonacci fib(30) - exponential time 04_fibonacci_iterativeIterative Fibonacci 100,000 iterations 13_factorialRecursive factorial factorial(12) - 100k calls 15_ackermannAckermann function A(3,7) - extreme recursion
Memory and GC stress
Benchmark Description Scale 05_gc_stressObject allocation/deallocation 50,000 temporary objects 06_gc_linked_structuresLinked list creation 1,000 lists × depth 100 16_tree_traversalBinary tree operations depth 15
Benchmark Description Scale 07_iteration_heavySimple counting loop 1,000,000 iterations 08_nested_loopsTriple nested loops 100³ = 1M iterations 17_while_loopWhile loop counting 1,000,000 iterations
Data structures
Benchmark Description Scale 09_list_operationsList append, access, iteration 10,000 elements 10_dict_operationsDictionary insert and lookup 10,000 entries 11_string_concatString concatenation 10,000 characters
Algorithms
Benchmark Description Scale 12_prime_sieveSieve of Eratosthenes up to 50,000 14_matrix_multiplyMatrix multiplication 50×50 matrices
Language features
Benchmark Description Scale 18_closure_stressFunction call overhead 50,000 calls 19_arithmetic_heavyHeavy arithmetic operations 1,000,000 iterations 20_method_dispatchStruct/class method calls 100,000 iterations
Interpreting results
Lower execution times are better
The benchmark script displays:
Execution time for each language
Comparative speedup factor
Color-coded results (green for Walrus wins, red for Python wins)
Sample output:
==========================================
Benchmark: 01_quicksort
==========================================
--- Python ---
Time: 0.45s
--- Walrus ---
Time: 0.32s
Walrus is 0.71x faster
Factors affecting results
Performance results may vary based on several factors:
Python version - Python 3.11+ includes JIT optimizations that significantly improve performance
System hardware - CPU, RAM, and system load affect benchmarks
Compilation options - Ensure Walrus is built with --release flag
Startup time - Both interpreters include initialization overhead
Compilation time - Walrus parsing and bytecode generation is included
JIT benchmark comparison
For JIT-enabled builds, you can compare interpreter vs JIT performance:
# Build with JIT support
cargo build --release --features jit
# Run without JIT
./target/release/walrus benchmarks/07_iteration_heavy.walrus
# Run with JIT enabled
./target/release/walrus benchmarks/07_iteration_heavy.walrus --jit
# Show JIT statistics
./target/release/walrus benchmarks/07_iteration_heavy.walrus --jit --jit-stats
Expected JIT speedup (hot loops):
Pattern Interpreter JIT Speedup 10K iterations × sum(0..1000) 0.68s 0.01s ~68×
Notes on fairness
Both languages run equivalent algorithms for fair comparison
Syntax differences are minimal (e.g., 0..n vs range(n))
Memory usage is not directly measured, only GC stress patterns
Python startup time and Walrus compilation time are both included
Syntax differences
Feature Python Walrus Range loop for i in range(n):for i in 0..n {Function def def foo(x):fn foo : x {None/null NonevoidPrint print(x)println(x)F-string f"x={x}"f"x={x}"
Contributing benchmarks
When adding new benchmarks:
Create both .walrus and .py files with the same basename
Ensure algorithms are equivalent
Include appropriate scale for meaningful timing
Add descriptions to the benchmark README
Test that both implementations produce correct results
See Contributing for more details.