KernelBuilder API

Overview

The KernelBuilder class provides a high-level API for constructing kernel programs that run on Anthropic’s custom VLIW SIMD simulator. It manages scratch space allocation, constant values, and instruction generation.

The KernelBuilder abstracts away low-level details of instruction packing and scratch space management, letting you focus on algorithm implementation.

Class Structure

perf_takehome.py:40-46

class KernelBuilder:
    def __init__(self):
        self.instrs = []           # Generated instruction list
        self.scratch = {}          # Name -> address mapping
        self.scratch_debug = {}    # Debug information
        self.scratch_ptr = 0       # Current allocation pointer
        self.const_map = {}        # Constant value cache

Key Methods

build() - Pack Instructions into VLIW Bundles

Converts a list of engine/slot tuples into instruction bundles.

perf_takehome.py:51-56

def build(self, slots: list[tuple[Engine, tuple]], vliw: bool = False):
    # Simple slot packing that just uses one slot per instruction bundle
    instrs = []
    for engine, slot in slots:
        instrs.append({engine: [slot]})
    return instrs

Parameters:

slots - List of (engine, slot_tuple) pairs
vliw - Enable VLIW packing (baseline uses False)

Returns: List of instruction dictionaries

The baseline implementation generates one instruction per bundle. Optimization opportunity: pack multiple independent operations into single bundles to exploit VLIW parallelism.

add() - Append Single Instruction

Adds a single instruction bundle to the program.

perf_takehome.py:58-59

def add(self, engine, slot):
    self.instrs.append({engine: [slot]})

Usage Example:

kb.add("load", ("const", tmp1, 42))
kb.add("alu", ("+", result, tmp1, tmp2))

alloc_scratch() - Allocate Scratch Memory

Allocates contiguous scratch space for variables or vectors.

perf_takehome.py:61-68

def alloc_scratch(self, name=None, length=1):
    addr = self.scratch_ptr
    if name is not None:
        self.scratch[name] = addr
        self.scratch_debug[addr] = (name, length)
    self.scratch_ptr += length
    assert self.scratch_ptr <= SCRATCH_SIZE, "Out of scratch space"
    return addr

Parameters:

name - Optional variable name for debugging
length - Number of 32-bit words to allocate

Returns: Starting address of allocated region

Scratch space is limited to 1536 words total. Plan your allocations carefully.

scratch_const() - Load Constant Values

Loads a constant into scratch space, reusing existing constants.

perf_takehome.py:70-75

def scratch_const(self, val, name=None):
    if val not in self.const_map:
        addr = self.alloc_scratch(name)
        self.add("load", ("const", addr, val))
        self.const_map[val] = addr
    return self.const_map[val]

Why use this: Automatically deduplicates constants to save scratch space and load instructions.Example:

zero = self.scratch_const(0, "zero")
one = self.scratch_const(1, "one")
# Second call returns cached address
another_zero = self.scratch_const(0)  # Same address as first zero

The build_kernel() Entry Point

Your main optimization target:

perf_takehome.py:88-94

def build_kernel(
    self, forest_height: int, n_nodes: int, batch_size: int, rounds: int
):
    """
    Like reference_kernel2 but building actual instructions.
    Scalar implementation using only scalar ALU and load/store.
    """

Signature:

forest_height - Height of the binary tree
n_nodes - Total nodes in the tree (2^(height+1) - 1)
batch_size - Number of parallel inputs to process
rounds - Number of tree traversal iterations

Your Goal

Rewrite this method to generate optimized instructions that minimize cycle count while maintaining correctness.

Scratch Space Management

The baseline kernel allocates scratch variables systematically:

perf_takehome.py:95-109

# Temporary registers
tmp1 = self.alloc_scratch("tmp1")
tmp2 = self.alloc_scratch("tmp2")
tmp3 = self.alloc_scratch("tmp3")

# Input parameter storage
init_vars = [
    "rounds", "n_nodes", "batch_size", "forest_height",
    "forest_values_p", "inp_indices_p", "inp_values_p"
]
for v in init_vars:
    self.alloc_scratch(v, 1)

# Load parameters from memory header
for i, v in enumerate(init_vars):
    self.add("load", ("const", tmp1, i))
    self.add("load", ("load", self.scratch[v], tmp1))

For SIMD optimization, allocate vector-sized scratch regions (length=8) and use vload/vstore instructions.

Building Instruction Sequences

The baseline builds a list of slots, then converts them to instructions:

perf_takehome.py:126-170

body = []  # array of slots

# Allocate working registers
tmp_idx = self.alloc_scratch("tmp_idx")
tmp_val = self.alloc_scratch("tmp_val")
tmp_node_val = self.alloc_scratch("tmp_node_val")
tmp_addr = self.alloc_scratch("tmp_addr")

for round in range(rounds):
    for i in range(batch_size):
        i_const = self.scratch_const(i)
        # Load index: idx = mem[inp_indices_p + i]
        body.append(("alu", ("+", tmp_addr, self.scratch["inp_indices_p"], i_const)))
        body.append(("load", ("load", tmp_idx, tmp_addr)))
        # ... more operations ...

# Convert slots to instruction bundles
body_instrs = self.build(body)
self.instrs.extend(body_instrs)

Hash Function Integration

The build_hash() helper generates instructions for the hash stages:

perf_takehome.py:77-86

def build_hash(self, val_hash_addr, tmp1, tmp2, round, i):
    slots = []
    for hi, (op1, val1, op2, op3, val3) in enumerate(HASH_STAGES):
        slots.append(("alu", (op1, tmp1, val_hash_addr, self.scratch_const(val1))))
        slots.append(("alu", (op3, tmp2, val_hash_addr, self.scratch_const(val3))))
        slots.append(("alu", (op2, val_hash_addr, tmp1, tmp2)))
        slots.append(("debug", ("compare", val_hash_addr, (round, i, "hash_stage", hi))))
    return slots

HASH_STAGES contains 6 stages, each generating 4 slots (24 slots total per hash). This is a prime target for VLIW packing.

Complete Baseline Example

The baseline kernel processes one input at a time:

perf_takehome.py:138-169

for round in range(rounds):
    for i in range(batch_size):
        i_const = self.scratch_const(i)
        
        # idx = mem[inp_indices_p + i]
        body.append(("alu", ("+", tmp_addr, self.scratch["inp_indices_p"], i_const)))
        body.append(("load", ("load", tmp_idx, tmp_addr)))
        
        # val = mem[inp_values_p + i]
        body.append(("alu", ("+", tmp_addr, self.scratch["inp_values_p"], i_const)))
        body.append(("load", ("load", tmp_val, tmp_addr)))
        
        # node_val = mem[forest_values_p + idx]
        body.append(("alu", ("+", tmp_addr, self.scratch["forest_values_p"], tmp_idx)))
        body.append(("load", ("load", tmp_node_val, tmp_addr)))
        
        # val = myhash(val ^ node_val)
        body.append(("alu", ("^", tmp_val, tmp_val, tmp_node_val)))
        body.extend(self.build_hash(tmp_val, tmp1, tmp2, round, i))
        
        # idx = 2*idx + (1 if val % 2 == 0 else 2)
        body.append(("alu", ("%", tmp1, tmp_val, two_const)))
        body.append(("alu", ("==", tmp1, tmp1, zero_const)))
        body.append(("flow", ("select", tmp3, tmp1, one_const, two_const)))
        body.append(("alu", ("*", tmp_idx, tmp_idx, two_const)))
        body.append(("alu", ("+", tmp_idx, tmp_idx, tmp3)))
        
        # Wrap: idx = 0 if idx >= n_nodes else idx
        body.append(("alu", ("<", tmp1, tmp_idx, self.scratch["n_nodes"])))
        body.append(("flow", ("select", tmp_idx, tmp1, tmp_idx, zero_const)))
        
        # Write back
        body.append(("alu", ("+", tmp_addr, self.scratch["inp_indices_p"], i_const)))
        body.append(("store", ("store", tmp_addr, tmp_idx)))
        body.append(("alu", ("+", tmp_addr, self.scratch["inp_values_p"], i_const)))
        body.append(("store", ("store", tmp_addr, tmp_val)))

Get Started

Challenge

Architecture

Kernel Development

Debugging

Overview

Class Structure

Key Methods

The build_kernel() Entry Point

Your Goal

Scratch Space Management

Building Instruction Sequences

Hash Function Integration

Complete Baseline Example

Next Steps

Optimization Strategies

Reference Implementations

Build docs developers (and LLMs) love

Get Started

Challenge

Architecture

Kernel Development

Debugging

​Overview

​Class Structure

​Key Methods

​The build_kernel() Entry Point

Your Goal

​Scratch Space Management

​Building Instruction Sequences

​Hash Function Integration

​Complete Baseline Example

​Next Steps

Optimization Strategies

Reference Implementations

Build docs developers (and LLMs) love

Overview

Class Structure

Key Methods

The build_kernel() Entry Point

Scratch Space Management

Building Instruction Sequences

Hash Function Integration

Complete Baseline Example

Next Steps