KernelBuilder

KernelBuilder provides a high-level API for constructing instruction sequences that run on the Machine simulator. It manages scratch space allocation and provides helper methods for common operations.

Class: KernelBuilder

The KernelBuilder class helps you construct programs for the custom VLIW architecture by managing instruction generation, scratch space allocation, and constant pooling.

Constructor

kb = KernelBuilder()

Initializes a new kernel builder with empty instruction list and scratch space. Internal State:

instrs: List of instruction bundles
scratch: Dictionary mapping variable names to scratch addresses
scratch_debug: Dictionary for debug information
scratch_ptr: Current scratch space allocation pointer
const_map: Map of constants to their scratch addresses

Methods

build()

Packs engine slots into instruction bundles.

slots

list[tuple[Engine, tuple]]

required

List of (engine, slot) pairs to pack into instructions

vliw

bool

default:"False"

Enable VLIW packing optimizations (currently unused)

return

list[Instruction]

List of instruction bundles, where each bundle is a dict mapping engines to slot lists

perf_takehome.py:51-56

def build(self, slots: list[tuple[Engine, tuple]], vliw: bool = False):
    # Simple slot packing that just uses one slot per instruction bundle
    instrs = []
    for engine, slot in slots:
        instrs.append({engine: [slot]})
    return instrs

add()

Appends a single instruction to the kernel.

engine

Engine

required

The execution engine (“alu”, “load”, “store”, “flow”, or “debug”)

slot

tuple

required

The instruction slot tuple (format depends on engine)

perf_takehome.py:58-59

def add(self, engine, slot):
    self.instrs.append({engine: [slot]})

Example:

kb = KernelBuilder()
kb.add("load", ("const", 0, 42))  # Load constant 42 into scratch[0]
kb.add("alu", ("+", 1, 0, 0))    # scratch[1] = scratch[0] + scratch[0]

alloc_scratch()

Allocates scratch space for variables.

name

str | None

default:"None"

Optional variable name for debugging

length

int

default:"1"

Number of scratch words to allocate (use 8 for vectors)

return

int

The scratch address of the allocated space

perf_takehome.py:61-68

def alloc_scratch(self, name=None, length=1):
    addr = self.scratch_ptr
    if name is not None:
        self.scratch[name] = addr
        self.scratch_debug[addr] = (name, length)
    self.scratch_ptr += length
    assert self.scratch_ptr <= SCRATCH_SIZE, "Out of scratch space"
    return addr

Example:

tmp = kb.alloc_scratch("tmp", 1)      # Allocate scalar
vec = kb.alloc_scratch("vec", 8)      # Allocate vector (VLEN=8)

scratch_const()

Allocates and loads a constant into scratch space with automatic deduplication.

val

int

required

The constant value to load

name

str | None

default:"None"

Optional name for the constant

return

int

Scratch address containing the constant

perf_takehome.py:70-75

def scratch_const(self, val, name=None):
    if val not in self.const_map:
        addr = self.alloc_scratch(name)
        self.add("load", ("const", addr, val))
        self.const_map[val] = addr
    return self.const_map[val]

Constants are automatically deduplicated - calling scratch_const(42) multiple times returns the same address.

build_hash()

Generates instruction slots for the hash function computation.

val_hash_addr

int

required

Scratch address containing value to hash (also stores result)

tmp1

int

required

Scratch address for temporary storage

tmp2

int

required

Scratch address for temporary storage

round

int

required

Current round number (for debug tracing)

int

required

Batch index (for debug tracing)

return

list[tuple]

List of (engine, slot) tuples implementing the hash stages

perf_takehome.py:77-86

def build_hash(self, val_hash_addr, tmp1, tmp2, round, i):
    slots = []

    for hi, (op1, val1, op2, op3, val3) in enumerate(HASH_STAGES):
        slots.append(("alu", (op1, tmp1, val_hash_addr, self.scratch_const(val1))))
        slots.append(("alu", (op3, tmp2, val_hash_addr, self.scratch_const(val3))))
        slots.append(("alu", (op2, val_hash_addr, tmp1, tmp2)))
        slots.append(("debug", ("compare", val_hash_addr, (round, i, "hash_stage", hi))))

    return slots

build_kernel()

This is the main method you’ll optimize for the performance challenge.

Generates the complete kernel for tree traversal. This is the scalar baseline implementation that you’ll optimize.

forest_height

int

required

Height of the binary tree

n_nodes

int

required

Total number of nodes in the tree (2^(height+1) - 1)

batch_size

int

required

Number of parallel traversals

rounds

int

required

Number of traversal rounds to perform

perf_takehome.py:88-174

def build_kernel(
    self, forest_height: int, n_nodes: int, batch_size: int, rounds: int
):
    """
    Like reference_kernel2 but building actual instructions.
    Scalar implementation using only scalar ALU and load/store.
    """
    tmp1 = self.alloc_scratch("tmp1")
    tmp2 = self.alloc_scratch("tmp2")
    tmp3 = self.alloc_scratch("tmp3")
    # Scratch space addresses
    init_vars = [
        "rounds",
        "n_nodes",
        "batch_size",
        "forest_height",
        "forest_values_p",
        "inp_indices_p",
        "inp_values_p",
    ]
    for v in init_vars:
        self.alloc_scratch(v, 1)
    for i, v in enumerate(init_vars):
        self.add("load", ("const", tmp1, i))
        self.add("load", ("load", self.scratch[v], tmp1))

    zero_const = self.scratch_const(0)
    one_const = self.scratch_const(1)
    two_const = self.scratch_const(2)

    self.add("flow", ("pause",))
    self.add("debug", ("comment", "Starting loop"))

    body = []  # array of slots

    # Scalar scratch registers
    tmp_idx = self.alloc_scratch("tmp_idx")
    tmp_val = self.alloc_scratch("tmp_val")
    tmp_node_val = self.alloc_scratch("tmp_node_val")
    tmp_addr = self.alloc_scratch("tmp_addr")

    for round in range(rounds):
        for i in range(batch_size):
            i_const = self.scratch_const(i)
            # idx = mem[inp_indices_p + i]
            body.append(("alu", ("+", tmp_addr, self.scratch["inp_indices_p"], i_const)))
            body.append(("load", ("load", tmp_idx, tmp_addr)))
            body.append(("debug", ("compare", tmp_idx, (round, i, "idx"))))
            # val = mem[inp_values_p + i]
            body.append(("alu", ("+", tmp_addr, self.scratch["inp_values_p"], i_const)))
            body.append(("load", ("load", tmp_val, tmp_addr)))
            body.append(("debug", ("compare", tmp_val, (round, i, "val"))))
            # node_val = mem[forest_values_p + idx]
            body.append(("alu", ("+", tmp_addr, self.scratch["forest_values_p"], tmp_idx)))
            body.append(("load", ("load", tmp_node_val, tmp_addr)))
            body.append(("debug", ("compare", tmp_node_val, (round, i, "node_val"))))
            # val = myhash(val ^ node_val)
            body.append(("alu", ("^", tmp_val, tmp_val, tmp_node_val)))
            body.extend(self.build_hash(tmp_val, tmp1, tmp2, round, i))
            body.append(("debug", ("compare", tmp_val, (round, i, "hashed_val"))))
            # idx = 2*idx + (1 if val % 2 == 0 else 2)
            body.append(("alu", ("%", tmp1, tmp_val, two_const)))
            body.append(("alu", ("==", tmp1, tmp1, zero_const)))
            body.append(("flow", ("select", tmp3, tmp1, one_const, two_const)))
            body.append(("alu", ("*", tmp_idx, tmp_idx, two_const)))
            body.append(("alu", ("+", tmp_idx, tmp_idx, tmp3)))
            body.append(("debug", ("compare", tmp_idx, (round, i, "next_idx"))))
            # idx = 0 if idx >= n_nodes else idx
            body.append(("alu", ("<", tmp1, tmp_idx, self.scratch["n_nodes"])))
            body.append(("flow", ("select", tmp_idx, tmp1, tmp_idx, zero_const)))
            body.append(("debug", ("compare", tmp_idx, (round, i, "wrapped_idx"))))
            # mem[inp_indices_p + i] = idx
            body.append(("alu", ("+", tmp_addr, self.scratch["inp_indices_p"], i_const)))
            body.append(("store", ("store", tmp_addr, tmp_idx)))
            # mem[inp_values_p + i] = val
            body.append(("alu", ("+", tmp_addr, self.scratch["inp_values_p"], i_const)))
            body.append(("store", ("store", tmp_addr, tmp_val)))

    body_instrs = self.build(body)
    self.instrs.extend(body_instrs)
    # Required to match with the yield in reference_kernel2
    self.instrs.append({"flow": [("pause",)]})

Algorithm:

Load problem parameters from memory header
For each round and batch item:
- Load index and value from input arrays
- Load tree node value
- Hash the XOR of input and node values
- Compute next index based on hash parity
- Wrap around if index exceeds tree bounds
- Store updated values back to memory

Helper Methods

debug_info()

Returns debug information for the Machine simulator.

return

DebugInfo

Debug information including scratch variable name mappings

perf_takehome.py:48-49

def debug_info(self):
    return DebugInfo(scratch_map=self.scratch_debug)

Complete Example

from problem import Tree, Input, Machine, build_mem_image
from perf_takehome import KernelBuilder
import random

# Generate test data
random.seed(123)
forest = Tree.generate(height=10)
inp = Input.generate(forest, batch_size=256, rounds=16)
mem = build_mem_image(forest, inp)

# Build kernel
kb = KernelBuilder()
kb.build_kernel(
    forest_height=forest.height,
    n_nodes=len(forest.values),
    batch_size=len(inp.indices),
    rounds=inp.rounds
)

# Run on simulator
machine = Machine(
    mem_dump=mem,
    program=kb.instrs,
    debug_info=kb.debug_info(),
    trace=False
)
machine.run()

print(f"Total cycles: {machine.cycle}")

Optimization Tips

Vectorization

Use valu, vload, and vstore instructions to process 8 elements at once (VLEN=8).

Instruction-Level Parallelism

Pack multiple independent operations into the same instruction bundle using available slots (12 ALU, 6 VALU, 2 load, 2 store per cycle).

Loop Unrolling

Unroll loops to expose more parallelism and reduce control flow overhead.

Memory Access Patterns

Optimize memory access patterns for better cache behavior and to maximize use of vector loads/stores.

The baseline implementation in build_kernel() achieves 147,734 cycles. Your goal is to optimize this by leveraging the VLIW and SIMD capabilities of the architecture.

Core Classes

Instructions

Testing

Class: KernelBuilder

Constructor

Methods

build()

add()

alloc_scratch()

scratch_const()

build_hash()

build_kernel()

Helper Methods

debug_info()

Complete Example

Optimization Tips

Vectorization

Instruction-Level Parallelism

Loop Unrolling

Memory Access Patterns

Build docs developers (and LLMs) love

Core Classes

Instructions

Testing

​Class: KernelBuilder

​Constructor

​Methods

​build()

​add()

​alloc_scratch()

​scratch_const()

​build_hash()

​build_kernel()

​Helper Methods

​debug_info()

​Complete Example

​Optimization Tips

Vectorization

Instruction-Level Parallelism

Loop Unrolling

Memory Access Patterns

Build docs developers (and LLMs) love

Class: KernelBuilder

Constructor

Methods

build()

add()

alloc_scratch()

scratch_const()

build_hash()

build_kernel()

Helper Methods

debug_info()

Complete Example

Optimization Tips