Overview
The performance challenge provides two reference implementations and a hash function that define the expected behavior. Your optimized kernel must produce identical results.The reference kernels are not meant to be fast—they’re clear, correct specifications of what your kernel should compute.
reference_kernel() - High-Level Python
The simplest specification of the algorithm:problem.py:467-484
Algorithm Breakdown
Tree Traversal Logic
Tree Traversal Logic
Data Structure: Implicit perfect binary treeWhy it wraps: After reaching a leaf node, the next index would exceed the tree size, so we reset to the root.
- Node 0 is the root
- Node
ihas children at2*i + 1(left) and2*i + 2(right) - Total nodes:
2^(height+1) - 1
problem.py:479-482
Input Processing
Input Processing
Batch Processing:Inputs:
problem.py:477-478
inp.indices- Current position of each input in the treeinp.values- Current value for each inputinp.rounds- Number of traversal iterations
Hash and XOR
Hash and XOR
problem.py:480
- Read the tree node value at current index
- XOR with the input value
- Hash the result using
myhash()
reference_kernel2() - Memory-Based with Tracing
A lower-level version that operates on flat memory and records intermediate values:problem.py:535-568
Memory Layout
Thebuild_mem_image() function creates a flat memory image:
problem.py:487-513
Tracing for Debugging
reference_kernel2() populates a trace dictionary with intermediate values:
problem.py:551-562
perf_takehome.py:140
myhash() - The Hash Function
A 32-bit hash function with 6 stages:problem.py:449-464
HASH_STAGES Configuration
problem.py:439-446
(op1, val1, op2, op3, val3)
Stage computation:
Hash in KernelBuilder
Thebuild_hash() method generates instructions for this:
perf_takehome.py:77-86
- 6 stages × 4 slots = 24 slots per hash
- Each input hashes once per round:
batch_size * rounds * 24slots - For test case (256 inputs, 16 rounds): 98,304 hash slots!
Optimization Target
The hash function dominates runtime. VLIW packing here yields massive gains:
- Baseline: 24 cycles per hash (one instruction per slot)
- Optimized: ~8 cycles per hash (pack 3 operations per bundle)
- 3x speedup just from hash optimization!
Correctness Validation
The testing harness compares your kernel against the references:perf_takehome.py:205-221
- After initial setup (first
pauseinstruction) - After complete computation (final
pause) - At every
debug compareinstruction (if enabled)
Using the References
Understand the Algorithm
Read
reference_kernel() to grasp the high-level logic:- What operations happen per input?
- What’s the data flow?
- Where’s the parallelism?
Study the Memory Layout
Examine
build_mem_image() and reference_kernel2():- Where are tree values stored?
- How are input arrays organized?
- What’s the header structure?
Trace Your Kernel
Run with tracing enabled to compare instruction-by-instruction:Debug instructions validate against the trace at every step.
Key Takeaways
Algorithm
Tree traversal with hash-based navigation, fully specified by
reference_kernel()Memory
Flat layout defined by
build_mem_image(), accessed via pointers in headerHash Function
6-stage computation defined by HASH_STAGES, major optimization target
Validation
Debug instructions compare against
reference_kernel2() traceNext Steps
Apply Optimization Strategies
Now that you understand the algorithm, learn how to make it fast.