build_kernel() method in the KernelBuilder class to minimize cycle count on a simulated machine.
Objective
Primary Goal
Optimize
KernelBuilder.build_kernel() to execute in the minimum number of cycles as measured by the frozen simulator in tests/submission_tests.py.What You’re Optimizing
The kernel implements a batch tree traversal with cryptographic hashing:The Algorithm
For each round and each item in the batch:- Load the current tree index and value from memory
- Read the node value from the forest at that index
- XOR the input value with the node value
- Hash the result through multiple stages (HASH_STAGES)
- Compute the next tree index based on the hash (left or right child)
- Wrap the index if it exceeds the tree bounds
- Store the updated index and value back to memory
What You Can Modify
You have full freedom to modify
KernelBuilder.build_kernel()You can change instruction sequences, memory access patterns, and algorithms
You can use any available machine features (VLIW, vector operations, etc.)
You can add helper methods to the
KernelBuilder classWhat You Cannot Modify
Rules and Constraints
Machine Architecture
You’re compiling for a simulated VLIW machine with:- Multiple execution engines (ALU, load, store, flow, debug)
- Configurable instruction bundling
- Specific cycle costs per operation
- Limited scratch space (
SCRATCH_SIZE) - Slot limits per engine (
SLOT_LIMITS)
Correctness Requirements
Your optimized kernel must:- Produce identical output to the reference kernel for all test cases
- Pass the correctness test in
tests/submission_tests.py - Use only valid instructions supported by the simulator
Validation Process
Validate your submission using:The testing harness in
perf_takehome.py includes debug validation with pause instructions that must match the reference kernel’s yields. The submission harness ignores these debug features.The LLM Cheating Problem
Common Cheating Patterns
Example: A model might:- Notice multicore support in
problem.py - Implement multicore as an “optimization”
- Notice
N_CORES = 1prevents speedup - “Fix” the core count to get artificial speedup
Multicore is intentionally disabled in this version. Don’t try to enable it!
Best Practices for AI-Assisted Development
If using an AI agent:Instruct it not to modify tests/
Explicitly tell your AI not to change anything in the
tests/ folderPerformance Test
The main performance benchmark:- forest_height: 10 (creates a binary tree with 1,023 nodes)
- rounds: 16 (number of full batch iterations)
- batch_size: 256 (items processed per round)
Baseline Performance
147,734 cycles - The unoptimized scalar implementation
Debugging Tools
The repository includes several debugging aids:Trace Visualization
Debug Instructions
The kernel supports debug instructions:Next Steps
View Benchmarks
See what performance levels to aim for
Study the Code
Dive into the source code and start optimizing