Skip to main content

Overview

The KernelBuilder class provides a high-level API for constructing kernel programs that run on Anthropic’s custom VLIW SIMD simulator. It manages scratch space allocation, constant values, and instruction generation.
The KernelBuilder abstracts away low-level details of instruction packing and scratch space management, letting you focus on algorithm implementation.

Class Structure

perf_takehome.py:40-46
class KernelBuilder:
    def __init__(self):
        self.instrs = []           # Generated instruction list
        self.scratch = {}          # Name -> address mapping
        self.scratch_debug = {}    # Debug information
        self.scratch_ptr = 0       # Current allocation pointer
        self.const_map = {}        # Constant value cache

Key Methods

Converts a list of engine/slot tuples into instruction bundles.
perf_takehome.py:51-56
def build(self, slots: list[tuple[Engine, tuple]], vliw: bool = False):
    # Simple slot packing that just uses one slot per instruction bundle
    instrs = []
    for engine, slot in slots:
        instrs.append({engine: [slot]})
    return instrs
Parameters:
  • slots - List of (engine, slot_tuple) pairs
  • vliw - Enable VLIW packing (baseline uses False)
Returns: List of instruction dictionaries
The baseline implementation generates one instruction per bundle. Optimization opportunity: pack multiple independent operations into single bundles to exploit VLIW parallelism.
Adds a single instruction bundle to the program.
perf_takehome.py:58-59
def add(self, engine, slot):
    self.instrs.append({engine: [slot]})
Usage Example:
kb.add("load", ("const", tmp1, 42))
kb.add("alu", ("+", result, tmp1, tmp2))
Allocates contiguous scratch space for variables or vectors.
perf_takehome.py:61-68
def alloc_scratch(self, name=None, length=1):
    addr = self.scratch_ptr
    if name is not None:
        self.scratch[name] = addr
        self.scratch_debug[addr] = (name, length)
    self.scratch_ptr += length
    assert self.scratch_ptr <= SCRATCH_SIZE, "Out of scratch space"
    return addr
Parameters:
  • name - Optional variable name for debugging
  • length - Number of 32-bit words to allocate
Returns: Starting address of allocated region
Scratch space is limited to 1536 words total. Plan your allocations carefully.
Loads a constant into scratch space, reusing existing constants.
perf_takehome.py:70-75
def scratch_const(self, val, name=None):
    if val not in self.const_map:
        addr = self.alloc_scratch(name)
        self.add("load", ("const", addr, val))
        self.const_map[val] = addr
    return self.const_map[val]
Why use this: Automatically deduplicates constants to save scratch space and load instructions.Example:
zero = self.scratch_const(0, "zero")
one = self.scratch_const(1, "one")
# Second call returns cached address
another_zero = self.scratch_const(0)  # Same address as first zero

The build_kernel() Entry Point

Your main optimization target:
perf_takehome.py:88-94
def build_kernel(
    self, forest_height: int, n_nodes: int, batch_size: int, rounds: int
):
    """
    Like reference_kernel2 but building actual instructions.
    Scalar implementation using only scalar ALU and load/store.
    """
Signature:
  • forest_height - Height of the binary tree
  • n_nodes - Total nodes in the tree (2^(height+1) - 1)
  • batch_size - Number of parallel inputs to process
  • rounds - Number of tree traversal iterations

Your Goal

Rewrite this method to generate optimized instructions that minimize cycle count while maintaining correctness.

Scratch Space Management

The baseline kernel allocates scratch variables systematically:
perf_takehome.py:95-109
# Temporary registers
tmp1 = self.alloc_scratch("tmp1")
tmp2 = self.alloc_scratch("tmp2")
tmp3 = self.alloc_scratch("tmp3")

# Input parameter storage
init_vars = [
    "rounds", "n_nodes", "batch_size", "forest_height",
    "forest_values_p", "inp_indices_p", "inp_values_p"
]
for v in init_vars:
    self.alloc_scratch(v, 1)

# Load parameters from memory header
for i, v in enumerate(init_vars):
    self.add("load", ("const", tmp1, i))
    self.add("load", ("load", self.scratch[v], tmp1))
For SIMD optimization, allocate vector-sized scratch regions (length=8) and use vload/vstore instructions.

Building Instruction Sequences

The baseline builds a list of slots, then converts them to instructions:
perf_takehome.py:126-170
body = []  # array of slots

# Allocate working registers
tmp_idx = self.alloc_scratch("tmp_idx")
tmp_val = self.alloc_scratch("tmp_val")
tmp_node_val = self.alloc_scratch("tmp_node_val")
tmp_addr = self.alloc_scratch("tmp_addr")

for round in range(rounds):
    for i in range(batch_size):
        i_const = self.scratch_const(i)
        # Load index: idx = mem[inp_indices_p + i]
        body.append(("alu", ("+", tmp_addr, self.scratch["inp_indices_p"], i_const)))
        body.append(("load", ("load", tmp_idx, tmp_addr)))
        # ... more operations ...

# Convert slots to instruction bundles
body_instrs = self.build(body)
self.instrs.extend(body_instrs)

Hash Function Integration

The build_hash() helper generates instructions for the hash stages:
perf_takehome.py:77-86
def build_hash(self, val_hash_addr, tmp1, tmp2, round, i):
    slots = []
    for hi, (op1, val1, op2, op3, val3) in enumerate(HASH_STAGES):
        slots.append(("alu", (op1, tmp1, val_hash_addr, self.scratch_const(val1))))
        slots.append(("alu", (op3, tmp2, val_hash_addr, self.scratch_const(val3))))
        slots.append(("alu", (op2, val_hash_addr, tmp1, tmp2)))
        slots.append(("debug", ("compare", val_hash_addr, (round, i, "hash_stage", hi))))
    return slots
HASH_STAGES contains 6 stages, each generating 4 slots (24 slots total per hash). This is a prime target for VLIW packing.

Complete Baseline Example

The baseline kernel processes one input at a time:
perf_takehome.py:138-169
for round in range(rounds):
    for i in range(batch_size):
        i_const = self.scratch_const(i)
        
        # idx = mem[inp_indices_p + i]
        body.append(("alu", ("+", tmp_addr, self.scratch["inp_indices_p"], i_const)))
        body.append(("load", ("load", tmp_idx, tmp_addr)))
        
        # val = mem[inp_values_p + i]
        body.append(("alu", ("+", tmp_addr, self.scratch["inp_values_p"], i_const)))
        body.append(("load", ("load", tmp_val, tmp_addr)))
        
        # node_val = mem[forest_values_p + idx]
        body.append(("alu", ("+", tmp_addr, self.scratch["forest_values_p"], tmp_idx)))
        body.append(("load", ("load", tmp_node_val, tmp_addr)))
        
        # val = myhash(val ^ node_val)
        body.append(("alu", ("^", tmp_val, tmp_val, tmp_node_val)))
        body.extend(self.build_hash(tmp_val, tmp1, tmp2, round, i))
        
        # idx = 2*idx + (1 if val % 2 == 0 else 2)
        body.append(("alu", ("%", tmp1, tmp_val, two_const)))
        body.append(("alu", ("==", tmp1, tmp1, zero_const)))
        body.append(("flow", ("select", tmp3, tmp1, one_const, two_const)))
        body.append(("alu", ("*", tmp_idx, tmp_idx, two_const)))
        body.append(("alu", ("+", tmp_idx, tmp_idx, tmp3)))
        
        # Wrap: idx = 0 if idx >= n_nodes else idx
        body.append(("alu", ("<", tmp1, tmp_idx, self.scratch["n_nodes"])))
        body.append(("flow", ("select", tmp_idx, tmp1, tmp_idx, zero_const)))
        
        # Write back
        body.append(("alu", ("+", tmp_addr, self.scratch["inp_indices_p"], i_const)))
        body.append(("store", ("store", tmp_addr, tmp_idx)))
        body.append(("alu", ("+", tmp_addr, self.scratch["inp_values_p"], i_const)))
        body.append(("store", ("store", tmp_addr, tmp_val)))

Next Steps

Optimization Strategies

Learn techniques to reduce cycle count

Reference Implementations

Understand the expected behavior

Build docs developers (and LLMs) love