Prerequisites
Before you begin, ensure you have:- Python 3.8+ installed on your system
- Basic understanding of performance optimization concepts
- A text editor or IDE for Python development
This challenge uses only Python standard library - no external dependencies required!
Installation
Verify your setup
Run the baseline tests to ensure everything is working:You should see output showing the baseline performance:
The baseline implementation achieves 147,734 cycles - your goal is to optimize this!
Running the Baseline Test
The main test runs a tree traversal simulation on a custom VLIW SIMD architecture. Here’s what happens:Test Parameters
- Forest height: 10 (creates a binary tree with 2,047 nodes)
- Rounds: 16 (iterations through the tree)
- Batch size: 256 (parallel traversals)
Your First Optimization
Let’s make a simple optimization to understand the workflow. The baseline uses a scalar implementation - let’s explore vectorization.Understanding the Current Implementation
Thebuild_kernel method in KernelBuilder class generates instructions for a scalar ALU implementation:
Optimization Strategy
The simulator supports SIMD operations withVLEN = 8, allowing you to process 8 elements simultaneously using:
vload/vstore- Vector load/store operationsvalu- Vector ALU operationsvbroadcast- Broadcast scalar to vector
Validating Your Results
Always validate your optimizations using the submission tests:Check for test modifications
Ensure you haven’t accidentally modified the test files:This should be empty. LLMs have been known to modify tests to make the problem easier!
Run submission tests
Execute the official validation tests:This runs correctness tests and evaluates your performance against benchmarks:
Check which thresholds you pass
The submission tests include multiple performance thresholds:
- 147,734 cycles: Baseline (starting point)
- 18,532 cycles: Updated take-home starting point (7.97x faster)
- 2,164 cycles: Claude Opus 4 after many hours
- 1,790 cycles: Claude Opus 4.5 casual session (best human 2hr performance)
- 1,487 cycles: Claude Opus 4.5 after 11.5 hours
- 1,363 cycles: Claude Opus 4.5 improved harness
Debug Workflow
For detailed debugging, use the trace visualization:- Instruction execution per cycle
- Engine utilization (ALU, load, store, flow)
- Scratch space variable changes
- Performance bottlenecks
The trace hot-reloads automatically when you re-run tests, making it ideal for iterative debugging.
Understanding the Architecture
The simulator models a VLIW (Very Long Instruction Word) SIMD architecture with:- Multiple engines executing in parallel per cycle
- Slot limits per engine (e.g., 12 ALU slots, 2 load slots, 2 store slots)
- Scratch space (1,536 32-bit words) serving as registers
- Vector operations processing 8 elements (VLEN=8) at once
problem.py:
Next Steps
Now that you have the baseline running, explore these optimization strategies:Architecture Deep Dive
Learn about the VLIW SIMD simulator and instruction set
Kernel Development
Master the KernelBuilder API and optimization techniques
Performance Benchmarks
See what performance levels are achievable
Debugging Guide
Use trace visualization and debugging tools effectively
Stuck? Review the reference implementations in
problem.py:reference_kernel()- High-level Python implementationreference_kernel2()- Flat memory implementation matching your kernel