Skip to main content
KernelBuilder provides a high-level API for constructing instruction sequences that run on the Machine simulator. It manages scratch space allocation and provides helper methods for common operations.

Class: KernelBuilder

The KernelBuilder class helps you construct programs for the custom VLIW architecture by managing instruction generation, scratch space allocation, and constant pooling.

Constructor

kb = KernelBuilder()
Initializes a new kernel builder with empty instruction list and scratch space. Internal State:
  • instrs: List of instruction bundles
  • scratch: Dictionary mapping variable names to scratch addresses
  • scratch_debug: Dictionary for debug information
  • scratch_ptr: Current scratch space allocation pointer
  • const_map: Map of constants to their scratch addresses

Methods

build()

Packs engine slots into instruction bundles.
slots
list[tuple[Engine, tuple]]
required
List of (engine, slot) pairs to pack into instructions
vliw
bool
default:"False"
Enable VLIW packing optimizations (currently unused)
return
list[Instruction]
List of instruction bundles, where each bundle is a dict mapping engines to slot lists
perf_takehome.py:51-56
def build(self, slots: list[tuple[Engine, tuple]], vliw: bool = False):
    # Simple slot packing that just uses one slot per instruction bundle
    instrs = []
    for engine, slot in slots:
        instrs.append({engine: [slot]})
    return instrs

add()

Appends a single instruction to the kernel.
engine
Engine
required
The execution engine (“alu”, “load”, “store”, “flow”, or “debug”)
slot
tuple
required
The instruction slot tuple (format depends on engine)
perf_takehome.py:58-59
def add(self, engine, slot):
    self.instrs.append({engine: [slot]})
Example:
kb = KernelBuilder()
kb.add("load", ("const", 0, 42))  # Load constant 42 into scratch[0]
kb.add("alu", ("+", 1, 0, 0))    # scratch[1] = scratch[0] + scratch[0]

alloc_scratch()

Allocates scratch space for variables.
name
str | None
default:"None"
Optional variable name for debugging
length
int
default:"1"
Number of scratch words to allocate (use 8 for vectors)
return
int
The scratch address of the allocated space
perf_takehome.py:61-68
def alloc_scratch(self, name=None, length=1):
    addr = self.scratch_ptr
    if name is not None:
        self.scratch[name] = addr
        self.scratch_debug[addr] = (name, length)
    self.scratch_ptr += length
    assert self.scratch_ptr <= SCRATCH_SIZE, "Out of scratch space"
    return addr
Example:
tmp = kb.alloc_scratch("tmp", 1)      # Allocate scalar
vec = kb.alloc_scratch("vec", 8)      # Allocate vector (VLEN=8)

scratch_const()

Allocates and loads a constant into scratch space with automatic deduplication.
val
int
required
The constant value to load
name
str | None
default:"None"
Optional name for the constant
return
int
Scratch address containing the constant
perf_takehome.py:70-75
def scratch_const(self, val, name=None):
    if val not in self.const_map:
        addr = self.alloc_scratch(name)
        self.add("load", ("const", addr, val))
        self.const_map[val] = addr
    return self.const_map[val]
Constants are automatically deduplicated - calling scratch_const(42) multiple times returns the same address.

build_hash()

Generates instruction slots for the hash function computation.
val_hash_addr
int
required
Scratch address containing value to hash (also stores result)
tmp1
int
required
Scratch address for temporary storage
tmp2
int
required
Scratch address for temporary storage
round
int
required
Current round number (for debug tracing)
i
int
required
Batch index (for debug tracing)
return
list[tuple]
List of (engine, slot) tuples implementing the hash stages
perf_takehome.py:77-86
def build_hash(self, val_hash_addr, tmp1, tmp2, round, i):
    slots = []

    for hi, (op1, val1, op2, op3, val3) in enumerate(HASH_STAGES):
        slots.append(("alu", (op1, tmp1, val_hash_addr, self.scratch_const(val1))))
        slots.append(("alu", (op3, tmp2, val_hash_addr, self.scratch_const(val3))))
        slots.append(("alu", (op2, val_hash_addr, tmp1, tmp2)))
        slots.append(("debug", ("compare", val_hash_addr, (round, i, "hash_stage", hi))))

    return slots

build_kernel()

This is the main method you’ll optimize for the performance challenge.

Generates the complete kernel for tree traversal. This is the scalar baseline implementation that you’ll optimize.
forest_height
int
required
Height of the binary tree
n_nodes
int
required
Total number of nodes in the tree (2^(height+1) - 1)
batch_size
int
required
Number of parallel traversals
rounds
int
required
Number of traversal rounds to perform
perf_takehome.py:88-174
def build_kernel(
    self, forest_height: int, n_nodes: int, batch_size: int, rounds: int
):
    """
    Like reference_kernel2 but building actual instructions.
    Scalar implementation using only scalar ALU and load/store.
    """
    tmp1 = self.alloc_scratch("tmp1")
    tmp2 = self.alloc_scratch("tmp2")
    tmp3 = self.alloc_scratch("tmp3")
    # Scratch space addresses
    init_vars = [
        "rounds",
        "n_nodes",
        "batch_size",
        "forest_height",
        "forest_values_p",
        "inp_indices_p",
        "inp_values_p",
    ]
    for v in init_vars:
        self.alloc_scratch(v, 1)
    for i, v in enumerate(init_vars):
        self.add("load", ("const", tmp1, i))
        self.add("load", ("load", self.scratch[v], tmp1))

    zero_const = self.scratch_const(0)
    one_const = self.scratch_const(1)
    two_const = self.scratch_const(2)

    self.add("flow", ("pause",))
    self.add("debug", ("comment", "Starting loop"))

    body = []  # array of slots

    # Scalar scratch registers
    tmp_idx = self.alloc_scratch("tmp_idx")
    tmp_val = self.alloc_scratch("tmp_val")
    tmp_node_val = self.alloc_scratch("tmp_node_val")
    tmp_addr = self.alloc_scratch("tmp_addr")

    for round in range(rounds):
        for i in range(batch_size):
            i_const = self.scratch_const(i)
            # idx = mem[inp_indices_p + i]
            body.append(("alu", ("+", tmp_addr, self.scratch["inp_indices_p"], i_const)))
            body.append(("load", ("load", tmp_idx, tmp_addr)))
            body.append(("debug", ("compare", tmp_idx, (round, i, "idx"))))
            # val = mem[inp_values_p + i]
            body.append(("alu", ("+", tmp_addr, self.scratch["inp_values_p"], i_const)))
            body.append(("load", ("load", tmp_val, tmp_addr)))
            body.append(("debug", ("compare", tmp_val, (round, i, "val"))))
            # node_val = mem[forest_values_p + idx]
            body.append(("alu", ("+", tmp_addr, self.scratch["forest_values_p"], tmp_idx)))
            body.append(("load", ("load", tmp_node_val, tmp_addr)))
            body.append(("debug", ("compare", tmp_node_val, (round, i, "node_val"))))
            # val = myhash(val ^ node_val)
            body.append(("alu", ("^", tmp_val, tmp_val, tmp_node_val)))
            body.extend(self.build_hash(tmp_val, tmp1, tmp2, round, i))
            body.append(("debug", ("compare", tmp_val, (round, i, "hashed_val"))))
            # idx = 2*idx + (1 if val % 2 == 0 else 2)
            body.append(("alu", ("%", tmp1, tmp_val, two_const)))
            body.append(("alu", ("==", tmp1, tmp1, zero_const)))
            body.append(("flow", ("select", tmp3, tmp1, one_const, two_const)))
            body.append(("alu", ("*", tmp_idx, tmp_idx, two_const)))
            body.append(("alu", ("+", tmp_idx, tmp_idx, tmp3)))
            body.append(("debug", ("compare", tmp_idx, (round, i, "next_idx"))))
            # idx = 0 if idx >= n_nodes else idx
            body.append(("alu", ("<", tmp1, tmp_idx, self.scratch["n_nodes"])))
            body.append(("flow", ("select", tmp_idx, tmp1, tmp_idx, zero_const)))
            body.append(("debug", ("compare", tmp_idx, (round, i, "wrapped_idx"))))
            # mem[inp_indices_p + i] = idx
            body.append(("alu", ("+", tmp_addr, self.scratch["inp_indices_p"], i_const)))
            body.append(("store", ("store", tmp_addr, tmp_idx)))
            # mem[inp_values_p + i] = val
            body.append(("alu", ("+", tmp_addr, self.scratch["inp_values_p"], i_const)))
            body.append(("store", ("store", tmp_addr, tmp_val)))

    body_instrs = self.build(body)
    self.instrs.extend(body_instrs)
    # Required to match with the yield in reference_kernel2
    self.instrs.append({"flow": [("pause",)]})
Algorithm:
  1. Load problem parameters from memory header
  2. For each round and batch item:
    • Load index and value from input arrays
    • Load tree node value
    • Hash the XOR of input and node values
    • Compute next index based on hash parity
    • Wrap around if index exceeds tree bounds
    • Store updated values back to memory

Helper Methods

debug_info()

Returns debug information for the Machine simulator.
return
DebugInfo
Debug information including scratch variable name mappings
perf_takehome.py:48-49
def debug_info(self):
    return DebugInfo(scratch_map=self.scratch_debug)

Complete Example

from problem import Tree, Input, Machine, build_mem_image
from perf_takehome import KernelBuilder
import random

# Generate test data
random.seed(123)
forest = Tree.generate(height=10)
inp = Input.generate(forest, batch_size=256, rounds=16)
mem = build_mem_image(forest, inp)

# Build kernel
kb = KernelBuilder()
kb.build_kernel(
    forest_height=forest.height,
    n_nodes=len(forest.values),
    batch_size=len(inp.indices),
    rounds=inp.rounds
)

# Run on simulator
machine = Machine(
    mem_dump=mem,
    program=kb.instrs,
    debug_info=kb.debug_info(),
    trace=False
)
machine.run()

print(f"Total cycles: {machine.cycle}")

Optimization Tips

Vectorization

Use valu, vload, and vstore instructions to process 8 elements at once (VLEN=8).

Instruction-Level Parallelism

Pack multiple independent operations into the same instruction bundle using available slots (12 ALU, 6 VALU, 2 load, 2 store per cycle).

Loop Unrolling

Unroll loops to expose more parallelism and reduce control flow overhead.

Memory Access Patterns

Optimize memory access patterns for better cache behavior and to maximize use of vector loads/stores.
The baseline implementation in build_kernel() achieves 147,734 cycles. Your goal is to optimize this by leveraging the VLIW and SIMD capabilities of the architecture.

Build docs developers (and LLMs) love