Overview
TheKernelBuilder class provides a high-level API for constructing kernel programs that run on Anthropic’s custom VLIW SIMD simulator. It manages scratch space allocation, constant values, and instruction generation.
The KernelBuilder abstracts away low-level details of instruction packing and scratch space management, letting you focus on algorithm implementation.
Class Structure
perf_takehome.py:40-46
Key Methods
build() - Pack Instructions into VLIW Bundles
build() - Pack Instructions into VLIW Bundles
Converts a list of engine/slot tuples into instruction bundles.Parameters:
perf_takehome.py:51-56
slots- List of(engine, slot_tuple)pairsvliw- Enable VLIW packing (baseline usesFalse)
The baseline implementation generates one instruction per bundle. Optimization opportunity: pack multiple independent operations into single bundles to exploit VLIW parallelism.
add() - Append Single Instruction
add() - Append Single Instruction
Adds a single instruction bundle to the program.Usage Example:
perf_takehome.py:58-59
alloc_scratch() - Allocate Scratch Memory
alloc_scratch() - Allocate Scratch Memory
Allocates contiguous scratch space for variables or vectors.Parameters:
perf_takehome.py:61-68
name- Optional variable name for debugginglength- Number of 32-bit words to allocate
scratch_const() - Load Constant Values
scratch_const() - Load Constant Values
Loads a constant into scratch space, reusing existing constants.Why use this: Automatically deduplicates constants to save scratch space and load instructions.Example:
perf_takehome.py:70-75
The build_kernel() Entry Point
Your main optimization target:perf_takehome.py:88-94
forest_height- Height of the binary treen_nodes- Total nodes in the tree (2^(height+1) - 1)batch_size- Number of parallel inputs to processrounds- Number of tree traversal iterations
Your Goal
Rewrite this method to generate optimized instructions that minimize cycle count while maintaining correctness.
Scratch Space Management
The baseline kernel allocates scratch variables systematically:perf_takehome.py:95-109
Building Instruction Sequences
The baseline builds a list of slots, then converts them to instructions:perf_takehome.py:126-170
Hash Function Integration
Thebuild_hash() helper generates instructions for the hash stages:
perf_takehome.py:77-86
HASH_STAGES contains 6 stages, each generating 4 slots (24 slots total per hash). This is a prime target for VLIW packing.
Complete Baseline Example
The baseline kernel processes one input at a time:perf_takehome.py:138-169
Next Steps
Optimization Strategies
Learn techniques to reduce cycle count
Reference Implementations
Understand the expected behavior