Optimization Strategies

Understanding the Baseline

The baseline implementation in build_kernel() processes inputs one at a time using scalar operations with no instruction packing:

# Baseline characteristics:
- One instruction per bundle (no VLIW parallelism)
- One input processed per iteration (no SIMD vectorization)
- Sequential memory operations (no prefetching)
- Unrolled loops generated at compile time

# Baseline performance: 147,734 cycles

The Goal

Optimize from the baseline of 147,734 cycles down as far as possible using VLIW, SIMD, and clever scheduling.

Key Optimization Techniques

1. VLIW Instruction Packing

What: Pack multiple independent operations into the same instruction bundle.Why: The architecture has multiple execution engines that can run in parallel:

12 ALU slots
6 VALU (vector ALU) slots
2 load slots
2 store slots
1 flow control slot

Baseline Issue:

perf_takehome.py:51-56

def build(self, slots: list[tuple[Engine, tuple]], vliw: bool = False):
    # Simple slot packing that just uses one slot per instruction bundle
    instrs = []
    for engine, slot in slots:
        instrs.append({engine: [slot]})  # One slot per bundle!
    return instrs

Optimization Approach:

# Instead of:
{"alu": [(op1)]}
{"alu": [(op2)]}
{"load": [(load1)]}

# Pack independent operations:
{"alu": [(op1), (op2), (op3)], "load": [(load1), (load2)]}

Dependency Constraint: Operations in the same bundle cannot depend on each other. All reads happen before all writes.

Example from Hash Function:

perf_takehome.py:77-86

# These hash operations have dependencies:
slots.append(("alu", (op1, tmp1, val_hash_addr, const1)))  # tmp1 = val ^ const
slots.append(("alu", (op3, tmp2, val_hash_addr, const2)))  # tmp2 = val << 12
slots.append(("alu", (op2, val_hash_addr, tmp1, tmp2)))     # val = tmp1 + tmp2

# First two can be packed (independent), third must be separate:
{"alu": [(op1, tmp1, ...), (op3, tmp2, ...)]}
{"alu": [(op2, val, tmp1, tmp2)]}

Impact: Can reduce instruction count by 3-5x in dense computation sections.

2. SIMD Vectorization

What: Process multiple inputs simultaneously using vector instructions.Why: The architecture supports vectors of length VLEN = 8.Baseline Issue:

perf_takehome.py:134-136

for round in range(rounds):
    for i in range(batch_size):  # Process ONE input at a time
        # ... scalar operations ...

Optimization Approach:

# Process 8 inputs at once:
for round in range(rounds):
    for i in range(0, batch_size, 8):  # Step by VLEN
        # Use vload, valu, vstore for 8-wide operations

Vector Instructions:

vload - Load 8 consecutive memory words
vstore - Store 8 consecutive memory words
valu - Vector arithmetic (8 operations in parallel)
vbroadcast - Copy scalar to all 8 vector lanes

Memory Layout Consideration:

# Scalar loads (baseline):
body.append(("alu", ("+", tmp_addr, base, i_const)))
body.append(("load", ("load", tmp_val, tmp_addr)))

# Vector loads (optimized):
# Requires: base + i points to 8 consecutive values
body.append(("alu", ("+", tmp_addr, base, i_const)))
body.append(("load", ("vload", tmp_val_vec, tmp_addr)))  # tmp_val_vec is 8 words

Allocate vector scratch space: vec_addr = self.alloc_scratch("vec_tmp", VLEN)

Challenge: The tree traversal has data-dependent memory accesses:

perf_takehome.py:146-148

# node_val = mem[forest_values_p + idx]
# idx is different for each input!
body.append(("alu", ("+", tmp_addr, self.scratch["forest_values_p"], tmp_idx)))
body.append(("load", ("load", tmp_node_val, tmp_addr)))

Solution Strategies:

Use scalar loads in a loop for the 8 vector elements
Reorganize data layout for contiguous access (if possible)
Accept gather/scatter overhead for this bottleneck

Impact: Up to 8x throughput improvement for vectorizable operations.

3. Loop Unrolling

What: Replicate loop body multiple times to reduce loop overhead and expose more parallelism.Baseline Already Does This:

perf_takehome.py:134-170

for round in range(rounds):        # Unrolled at compile time
    for i in range(batch_size):    # Unrolled at compile time
        # ... operations ...

The Python loops generate static instruction sequences—there’s no runtime loop overhead!Further Unrolling Strategies:

Process multiple vectors per iteration:

for i in range(0, batch_size, 16):  # Process 16 inputs (2 vectors)
    # ... operations for inputs i:i+8 ...
    # ... operations for inputs i+8:i+16 ...

Benefit: Exposes more independent operations for VLIW packing.

Unroll hash stages:
perf_takehome.py:80-85
```
for hi, (op1, val1, op2, op3, val3) in enumerate(HASH_STAGES):
    # 6 iterations, each generates 3 ALU ops
```
Already unrolled, but you could manually schedule all 18 ALU operations for better packing.

Code Size Tradeoff: Excessive unrolling increases instruction count, which can hurt cache behavior (though the simulator doesn’t model instruction cache).

4. Memory Access Optimization

What: Minimize memory latency and maximize throughput.Key Insights:

The simulator has 2 load slots and 2 store slots per cycle
Memory operations complete in the same cycle (no modeled latency)
Loads and stores can be packed with computation

Baseline Pattern:

perf_takehome.py:138-145

# Load idx
body.append(("alu", ("+", tmp_addr, base, offset)))  # Compute address
body.append(("load", ("load", tmp_idx, tmp_addr)))   # Load value

# Load val
body.append(("alu", ("+", tmp_addr, base2, offset)))
body.append(("load", ("load", tmp_val, tmp_addr)))

Optimization 1: Pack Address Computation

# Compute both addresses in parallel:
{"alu": [
    ("+", tmp_addr1, base1, offset),
    ("+", tmp_addr2, base2, offset)
]}
# Then load both:
{"load": [
    ("load", tmp_idx, tmp_addr1),
    ("load", tmp_val, tmp_addr2)
]}

Optimization 2: Vector Loads

# Load 8 values in one instruction:
{"load": [("vload", vec_dst, addr_scalar)]}

Optimization 3: Prefetch (Limited Benefit) Since loads complete immediately, prefetching only helps if you can hide address computation latency.

Store Buffering: Delay stores to the end of computation when possible. This frees up store slots during the critical path.

5. Instruction Scheduling

What: Reorder operations to maximize VLIW slot utilization.Goal: Keep all execution engines busy every cycle.Dependency Graph Example:

A = load(addr1)     [load engine]
B = load(addr2)     [load engine]
C = A + B           [alu engine, depends on A and B]
D = C * 2           [alu engine, depends on C]
E = load(addr3)     [load engine]
F = E + 1           [alu engine, depends on E]

Bad Schedule (Baseline Style):

Cycle 0: {"load": [(load A)]}
Cycle 1: {"load": [(load B)]}
Cycle 2: {"alu": [(C = A + B)]}
Cycle 3: {"alu": [(D = C * 2)]}
Cycle 4: {"load": [(load E)]}
Cycle 5: {"alu": [(F = E + 1)]}
Total: 6 cycles

Good Schedule (VLIW + Reorder):

Cycle 0: {"load": [(load A), (load B)]}
Cycle 1: {"alu": [(C = A + B)], "load": [(load E)]}
Cycle 2: {"alu": [(D = C * 2), (F = E + 1)]}
Total: 3 cycles (2x speedup!)

Strategy:

Build a dependency graph
Schedule independent operations into the same cycle
Respect engine slot limits
Minimize critical path length

Tools:

List scheduling algorithms
Topological sort of dependency DAG
Manual analysis for hot loops

Common Pitfalls

1. Read-After-Write Hazards

Operations in the same bundle cannot have dependencies:

# WRONG: tmp is written and read in same bundle
{"alu": [
    ("+", tmp, a, b),    # Write tmp
    ("*", result, tmp, c) # Read tmp - NOT ALLOWED!
]}

# CORRECT: Split into two cycles
{"alu": [("+", tmp, a, b)]}
{"alu": [("*", result, tmp, c)]}

2. Vector Alignment Issues

Vector operations assume contiguous memory:

# OK: Indices are stored contiguously
vload(vec_indices, inp_indices_p + round_offset)

# PROBLEM: Tree values are accessed via computed indices
# Cannot vectorize: mem[forest_values_p + idx] for different idx values

3. Scratch Space Exhaustion

Vector operations consume 8x scratch space:

SCRATCH_SIZE = 1536  # Total available

# Baseline scalars:
tmp_val = self.alloc_scratch("tmp_val", 1)  # 1 word

# Vector version:
vec_val = self.alloc_scratch("vec_val", 8)  # 8 words!

# Processing 256 inputs with 8-wide vectors needs careful planning

Solution: Reuse scratch space across loop iterations.

4. Premature Optimization

Profile before optimizing:

# Enable tracing to see where cycles are spent:
do_kernel_test(10, 16, 256, trace=True)
# Then run: python watch_trace.py

Focus on:

The innermost loop (most cycles)
Hash function (24 instructions per input)
Load bottlenecks

Optimization Workflow

Measure Baseline

Run python perf_takehome.py Tests.test_kernel_cycles to get your starting point.

CYCLES: 147734
Speedup over baseline: 1.0

Trace Analysis

Generate a performance trace:

python perf_takehome.py Tests.test_kernel_trace
python watch_trace.py

Look for:

Empty execution slots (wasted parallelism)
Long serial chains (critical path)
Repeated patterns (unrolling opportunities)

Apply One Technique

Start with the highest-impact optimization:

Quick win: VLIW packing in hash function (3x speedup potential)
Medium: Vectorize main loop body (8x throughput)
Advanced: Custom scheduling for the entire kernel

Validate Correctness

Always run correctness tests after changes:

python tests/submission_tests.py

The debug engine compares your results against the reference at every step.

Iterate

Measure → Trace → Optimize → Validate → Repeat

Expected Performance Tiers

Bronze

2-3x speedupBasic VLIW packing in hash function

Silver

5-10x speedupSIMD vectorization + instruction packing

Gold

15x+ speedupAggressive scheduling, loop transformations, full utilization

Next Steps

Study Reference Implementations

Understand the expected behavior by examining reference_kernel() and reference_kernel2().

Get Started

Challenge

Architecture

Kernel Development

Debugging

Understanding the Baseline

The Goal

Key Optimization Techniques

Common Pitfalls

1. Read-After-Write Hazards

2. Vector Alignment Issues

3. Scratch Space Exhaustion

4. Premature Optimization

Optimization Workflow

Expected Performance Tiers

Bronze

Silver

Gold

Next Steps

Study Reference Implementations

Build docs developers (and LLMs) love

Get Started

Challenge

Architecture

Kernel Development

Debugging

​Understanding the Baseline

The Goal

​Key Optimization Techniques

​Common Pitfalls

​1. Read-After-Write Hazards

​2. Vector Alignment Issues

​3. Scratch Space Exhaustion

​4. Premature Optimization

​Optimization Workflow

​Expected Performance Tiers

Bronze

Silver

Gold

​Next Steps

Study Reference Implementations

Build docs developers (and LLMs) love

Understanding the Baseline

Key Optimization Techniques

Common Pitfalls

1. Read-After-Write Hazards

2. Vector Alignment Issues

3. Scratch Space Exhaustion

4. Premature Optimization

Optimization Workflow

Expected Performance Tiers

Next Steps