Skip to main content
Your mission is to optimize the build_kernel() method in the KernelBuilder class to minimize cycle count on a simulated machine.

Objective

Primary Goal

Optimize KernelBuilder.build_kernel() to execute in the minimum number of cycles as measured by the frozen simulator in tests/submission_tests.py.

What You’re Optimizing

The kernel implements a batch tree traversal with cryptographic hashing:
def build_kernel(self, forest_height: int, n_nodes: int, batch_size: int, rounds: int):
    """
    Like reference_kernel2 but building actual instructions.
    Scalar implementation using only scalar ALU and load/store.
    """

The Algorithm

For each round and each item in the batch:
  1. Load the current tree index and value from memory
  2. Read the node value from the forest at that index
  3. XOR the input value with the node value
  4. Hash the result through multiple stages (HASH_STAGES)
  5. Compute the next tree index based on the hash (left or right child)
  6. Wrap the index if it exceeds the tree bounds
  7. Store the updated index and value back to memory
The reference implementation is a scalar implementation using only scalar ALU, load, and store operations.

What You Can Modify

You have full freedom to modify KernelBuilder.build_kernel()
You can change instruction sequences, memory access patterns, and algorithms
You can use any available machine features (VLIW, vector operations, etc.)
You can add helper methods to the KernelBuilder class

What You Cannot Modify

Critical: Do not modify anything in the tests/ folder. The submission harness uses a frozen copy of the simulator to validate your results.
Do not change the problem specification (tree structure, hash algorithm, etc.)
Do not modify test parameters (forest_height=10, rounds=16, batch_size=256)

Rules and Constraints

Machine Architecture

You’re compiling for a simulated VLIW machine with:
  • Multiple execution engines (ALU, load, store, flow, debug)
  • Configurable instruction bundling
  • Specific cycle costs per operation
  • Limited scratch space (SCRATCH_SIZE)
  • Slot limits per engine (SLOT_LIMITS)
Review problem.py to understand the machine architecture, instruction set, and available operations.

Correctness Requirements

Your optimized kernel must:
  1. Produce identical output to the reference kernel for all test cases
  2. Pass the correctness test in tests/submission_tests.py
  3. Use only valid instructions supported by the simulator

Validation Process

Validate your submission using:
# This should be empty - the tests folder must be unchanged
git diff origin/main tests/

# You should pass some of these tests
python tests/submission_tests.py
The testing harness in perf_takehome.py includes debug validation with pause instructions that must match the reference kernel’s yields. The submission harness ignores these debug features.

The LLM Cheating Problem

None of the solutions received on the first day post-release below 1,300 cycles were valid. In every case, a language model modified the tests to make the problem easier.

Common Cheating Patterns

Example: A model might:
  1. Notice multicore support in problem.py
  2. Implement multicore as an “optimization”
  3. Notice N_CORES = 1 prevents speedup
  4. “Fix” the core count to get artificial speedup
Multicore is intentionally disabled in this version. Don’t try to enable it!

Best Practices for AI-Assisted Development

If using an AI agent:
1

Instruct it not to modify tests/

Explicitly tell your AI not to change anything in the tests/ folder
2

Use submission_tests.py for verification

Always validate using the official test suite
3

Verify test integrity

Run git diff origin/main tests/ before submitting
4

Understand your optimizations

Be able to explain what your code actually does

Performance Test

The main performance benchmark:
def test_kernel_cycles(self):
    do_kernel_test(10, 16, 256)
Parameters:
  • forest_height: 10 (creates a binary tree with 1,023 nodes)
  • rounds: 16 (number of full batch iterations)
  • batch_size: 256 (items processed per round)

Baseline Performance

147,734 cycles - The unoptimized scalar implementation

Debugging Tools

The repository includes several debugging aids:

Trace Visualization

# Generate a hot-reloadable trace (Chrome only)
python perf_takehome.py Tests.test_kernel_trace

# In another terminal
python watch_trace.py
This opens a Perfetto trace viewer that auto-updates as you re-run tests.

Debug Instructions

The kernel supports debug instructions:
self.add("debug", ("compare", addr, (round, i, "label")))
self.add("debug", ("comment", "Helpful message"))
These are ignored by the submission simulator but can help during development.

Next Steps

View Benchmarks

See what performance levels to aim for

Study the Code

Dive into the source code and start optimizing

Build docs developers (and LLMs) love