Validation

Overview

Validation ensures your optimized kernel produces correct results and meets performance targets. The challenge provides multiple layers of validation, from unit tests to submission benchmarks.

Correctness before performance. Always ensure your kernel is correct before optimizing for speed.

Running Unit Tests

All Tests

Run all tests in perf_takehome.py:

python perf_takehome.py

Specific Tests

Run individual test methods:

python perf_takehome.py Tests.test_kernel_cycles

Available Tests

perf_takehome.py:228-258

class Tests(unittest.TestCase):
    def test_ref_kernels(self):
        """Test the reference kernels against each other"""
        
    def test_kernel_trace(self):
        # Full-scale example for performance testing
        do_kernel_test(10, 16, 256, trace=True, prints=False)
    
    def test_kernel_cycles(self):
        do_kernel_test(10, 16, 256)

The commented-out test_kernel_correctness test can be uncommented for additional debugging with various parameter combinations.

Submission Tests

The official validation for your submission uses tests/submission_tests.py:

python tests/submission_tests.py

Important: Don't modify tests/ folder

From the instructions:

perf_takehome.py:13-14

Validate your results using `python tests/submission_tests.py` without modifying
anything in the tests/ folder.

Correctness Tests

submission_tests.py:57-60

class CorrectnessTests(unittest.TestCase):
    def test_kernel_correctness(self):
        for i in range(8):
            do_kernel_test(10, 16, 256)

Runs 8 randomized correctness tests with the full-scale problem parameters.

Speed Tests

The submission includes progressive speed thresholds:

submission_tests.py:76-116

class SpeedTests(unittest.TestCase):
    def test_kernel_speedup(self):
        assert cycles() < BASELINE  # 147734 cycles
    
    def test_kernel_updated_starting_point(self):
        assert cycles() < 18532
    
    def test_opus4_many_hours(self):
        assert cycles() < 2164
    
    def test_opus45_casual(self):
        assert cycles() < 1790  # Best human performance in 2 hours
    
    def test_opus45_2hr(self):
        assert cycles() < 1579

You don’t need to pass all speed tests to succeed. The impressiveness isn’t linear in the number of tests passed.

Debug Engine Instructions

The simulator includes a debug engine for validating intermediate values against the reference implementation.

Compare Instructions

Check scalar values during execution:

perf_takehome.py:140-144

body.append(("debug", ("compare", tmp_idx, (round, i, "idx"))))
body.append(("alu", ("+", tmp_addr, self.scratch["inp_values_p"], i_const)))
body.append(("load", ("load", tmp_val, tmp_addr)))
body.append(("debug", ("compare", tmp_val, (round, i, "val"))))

The debug engine compares your kernel’s value against the reference:

problem.py:366-374

if name == "debug":
    if not self.enable_debug:
        continue
    for slot in slots:
        if slot[0] == "compare":
            loc, key = slot[1], slot[2]
            ref = self.value_trace[key]
            res = core.scratch[loc]
            assert res == ref, f"{res} != {ref} for {key} at pc={core.pc}"

Vector Compare

For SIMD operations, use vcompare:

problem.py:375-381

elif slot[0] == "vcompare":
    loc, keys = slot[1], slot[2]
    ref = [self.value_trace[key] for key in keys]
    res = core.scratch[loc : loc + VLEN]
    assert res == ref, (
        f"{res} != {ref} for {keys} at pc={core.pc} loc={loc}"
    )

Debug Comments

Add comments to trace output for clarity:

perf_takehome.py:124

self.add("debug", ("comment", "Starting loop"))

Debug instructions are ignored by the submission simulator. They only affect validation during development.

The Pause/Yield Mechanism

The pause mechanism synchronizes your kernel with the reference implementation for step-by-step validation.

How It Works

Reference kernel uses yield to pause at checkpoints:

problem.py:535-568

def reference_kernel2(mem: list[int], trace: dict[Any, int] = {}):
    # ... initialization ...
    yield mem  # Initial state
    for h in range(rounds):
        for i in range(batch_size):
            # ... computation ...
    yield mem  # Final state

Your kernel uses pause instructions to match:

perf_takehome.py:121-123

self.add("flow", ("pause",))
self.add("debug", ("comment", "Starting loop"))

Validation loop steps through both:

perf_takehome.py:206-221

for i, ref_mem in enumerate(reference_kernel2(mem, value_trace)):
    machine.run()  # Run until next pause
    inp_values_p = ref_mem[6]
    assert (
        machine.mem[inp_values_p : inp_values_p + len(inp.values)]
        == ref_mem[inp_values_p : inp_values_p + len(inp.values)]
    ), f"Incorrect result on round {i}"

Important: Pause instructions must match the reference kernel’s yields during development testing, but are ignored by the submission simulator.

Controlling Pause Behavior

problem.py:116-117

self.enable_pause = True
self.enable_debug = True

Submission tests disable both:

submission_tests.py:41-42

machine.enable_pause = False
machine.enable_debug = False

Comparing with Reference Kernels

Two reference implementations are provided:

reference_kernel

Python implementation operating on Tree and Input objects

reference_kernel2

Flat memory implementation that matches the simulator’s memory layout

Using reference_kernel2 for Validation

perf_takehome.py:229-242

def test_ref_kernels(self):
    """Test the reference kernels against each other"""
    random.seed(123)
    for i in range(10):
        f = Tree.generate(4)
        inp = Input.generate(f, 10, 6)
        mem = build_mem_image(f, inp)
        reference_kernel(f, inp)
        for _ in reference_kernel2(mem, {}):
            pass
        assert inp.indices == mem[mem[5] : mem[5] + len(inp.indices)]
        assert inp.values == mem[mem[6] : mem[6] + len(inp.values)]

Common Validation Errors

Incorrect result on round N

Your kernel’s output doesn’t match the reference at a specific iteration.Debug steps:

Add debug compare instructions at intermediate steps
Run with prints=True to see values
Use trace visualization to inspect scratch variables

Assertion: res != ref for key at pc=X

A debug compare instruction failed.The error shows:

res: Your computed value
ref: Expected value from reference
key: Which variable/step failed
pc: Program counter (instruction number)

This pinpoints exactly where your computation diverged.

Out of scratch space

perf_takehome.py:67

assert self.scratch_ptr <= SCRATCH_SIZE, "Out of scratch space"

You’ve allocated more than 1536 words of scratch memory. Optimize scratch usage or reuse variables.

Exceeds slot limit

problem.py:383

assert len(slots) <= SLOT_LIMITS[name]

An instruction tries to use more slots than available for that engine. Check SLOT_LIMITS and pack instructions properly.

Correctness vs Performance

Start with correctness

Ensure your kernel passes all correctness tests before optimizing.

Establish baseline

Run test_kernel_cycles to get your initial cycle count:

perf_takehome.py:223-225

print("CYCLES: ", machine.cycle)
print("Speedup over baseline: ", BASELINE / machine.cycle)

Optimize incrementally

Make one optimization at a time and re-validate after each change.

Test submission

Regularly run python tests/submission_tests.py to ensure you haven’t broken correctness.

Real Validation Examples

Example 1: Checking Hash Computation

perf_takehome.py:150-152

body.append(("alu", ("^", tmp_val, tmp_val, tmp_node_val)))
body.extend(self.build_hash(tmp_val, tmp1, tmp2, round, i))
body.append(("debug", ("compare", tmp_val, (round, i, "hashed_val"))))

This validates the hash result after XOR and hashing.

Example 2: Verifying Index Wrapping

perf_takehome.py:160-163

body.append(("alu", ("<", tmp1, tmp_idx, self.scratch["n_nodes"])))
body.append(("flow", ("select", tmp_idx, tmp1, tmp_idx, zero_const)))
body.append(("debug", ("compare", tmp_idx, (round, i, "wrapped_idx"))))

Ensures the index correctly wraps to 0 when it exceeds n_nodes.

Example 3: Full Validation Loop

perf_takehome.py:183-221

machine = Machine(
    mem,
    kb.instrs,
    kb.debug_info(),
    n_cores=N_CORES,
    value_trace=value_trace,
    trace=trace,
)
machine.prints = prints
for i, ref_mem in enumerate(reference_kernel2(mem, value_trace)):
    machine.run()
    inp_values_p = ref_mem[6]
    if prints:
        print(machine.mem[inp_values_p : inp_values_p + len(inp.values)])
        print(ref_mem[inp_values_p : inp_values_p + len(inp.values)])
    assert (
        machine.mem[inp_values_p : inp_values_p + len(inp.values)]
        == ref_mem[inp_values_p : inp_values_p + len(inp.values)]
    ), f"Incorrect result on round {i}"

Debugging with Prints

Enable detailed output during validation:

do_kernel_test(10, 16, 256, trace=False, prints=True)

This prints:

Scratch variable states
Instruction details
Memory contents at validation points

Use prints=True when debugging specific computation errors. The output can be verbose but invaluable for understanding what’s happening.

Best Practices

Test Early, Test Often

Run validation after every significant change

Use Debug Instructions

Add compare instructions at critical computation steps

Understand Failures

Don’t just fix errors — understand why they occurred

Keep Reference Sync

Maintain pause/yield alignment during development

Get Started

Challenge

Architecture

Kernel Development

Debugging

Overview

Running Unit Tests

All Tests

Specific Tests

Available Tests

Submission Tests

Correctness Tests

Speed Tests

Debug Engine Instructions

Compare Instructions

Vector Compare

Debug Comments

The Pause/Yield Mechanism

How It Works

Controlling Pause Behavior

Comparing with Reference Kernels

reference_kernel

reference_kernel2

Using reference_kernel2 for Validation

Common Validation Errors

Correctness vs Performance

Real Validation Examples

Example 1: Checking Hash Computation

Example 2: Verifying Index Wrapping

Example 3: Full Validation Loop

Debugging with Prints

Best Practices

Test Early, Test Often

Use Debug Instructions

Understand Failures

Keep Reference Sync

Build docs developers (and LLMs) love

Get Started

Challenge

Architecture

Kernel Development

Debugging

​Overview

​Running Unit Tests

​All Tests

​Specific Tests

​Available Tests

​Submission Tests

​Correctness Tests

​Speed Tests

​Debug Engine Instructions

​Compare Instructions

​Vector Compare

​Debug Comments

​The Pause/Yield Mechanism

​How It Works

​Controlling Pause Behavior

​Comparing with Reference Kernels

reference_kernel

reference_kernel2

​Using reference_kernel2 for Validation

​Common Validation Errors

​Correctness vs Performance

​Real Validation Examples

​Example 1: Checking Hash Computation

​Example 2: Verifying Index Wrapping

​Example 3: Full Validation Loop

​Debugging with Prints

​Best Practices

Test Early, Test Often

Use Debug Instructions

Understand Failures

Keep Reference Sync

Build docs developers (and LLMs) love

Overview

Running Unit Tests

All Tests

Specific Tests

Available Tests

Submission Tests

Correctness Tests

Speed Tests

Debug Engine Instructions

Compare Instructions

Vector Compare

Debug Comments

The Pause/Yield Mechanism

How It Works

Controlling Pause Behavior

Comparing with Reference Kernels

Using reference_kernel2 for Validation

Common Validation Errors

Correctness vs Performance

Real Validation Examples

Example 1: Checking Hash Computation

Example 2: Verifying Index Wrapping

Example 3: Full Validation Loop

Debugging with Prints

Best Practices