KernelBuilder provides a high-level API for constructing instruction sequences that run on the Machine simulator. It manages scratch space allocation and provides helper methods for common operations.
Class: KernelBuilder
TheKernelBuilder class helps you construct programs for the custom VLIW architecture by managing instruction generation, scratch space allocation, and constant pooling.
Constructor
instrs: List of instruction bundlesscratch: Dictionary mapping variable names to scratch addressesscratch_debug: Dictionary for debug informationscratch_ptr: Current scratch space allocation pointerconst_map: Map of constants to their scratch addresses
Methods
build()
Packs engine slots into instruction bundles.List of (engine, slot) pairs to pack into instructions
Enable VLIW packing optimizations (currently unused)
List of instruction bundles, where each bundle is a dict mapping engines to slot lists
perf_takehome.py:51-56
add()
Appends a single instruction to the kernel.The execution engine (“alu”, “load”, “store”, “flow”, or “debug”)
The instruction slot tuple (format depends on engine)
perf_takehome.py:58-59
alloc_scratch()
Allocates scratch space for variables.Optional variable name for debugging
Number of scratch words to allocate (use 8 for vectors)
The scratch address of the allocated space
perf_takehome.py:61-68
scratch_const()
Allocates and loads a constant into scratch space with automatic deduplication.The constant value to load
Optional name for the constant
Scratch address containing the constant
perf_takehome.py:70-75
Constants are automatically deduplicated - calling
scratch_const(42) multiple times returns the same address.build_hash()
Generates instruction slots for the hash function computation.Scratch address containing value to hash (also stores result)
Scratch address for temporary storage
Scratch address for temporary storage
Current round number (for debug tracing)
Batch index (for debug tracing)
List of (engine, slot) tuples implementing the hash stages
perf_takehome.py:77-86
build_kernel()
This is the main method you’ll optimize for the performance challenge.
Height of the binary tree
Total number of nodes in the tree (2^(height+1) - 1)
Number of parallel traversals
Number of traversal rounds to perform
perf_takehome.py:88-174
- Load problem parameters from memory header
- For each round and batch item:
- Load index and value from input arrays
- Load tree node value
- Hash the XOR of input and node values
- Compute next index based on hash parity
- Wrap around if index exceeds tree bounds
- Store updated values back to memory
Helper Methods
debug_info()
Returns debug information for the Machine simulator.Debug information including scratch variable name mappings
perf_takehome.py:48-49
Complete Example
Optimization Tips
Vectorization
Use
valu, vload, and vstore instructions to process 8 elements at once (VLEN=8).Instruction-Level Parallelism
Pack multiple independent operations into the same instruction bundle using available slots (12 ALU, 6 VALU, 2 load, 2 store per cycle).
Loop Unrolling
Unroll loops to expose more parallelism and reduce control flow overhead.
Memory Access Patterns
Optimize memory access patterns for better cache behavior and to maximize use of vector loads/stores.