Quickstart

Get hands-on with the challenge by running the baseline implementation and making your first optimization.

Prerequisites

Before you begin, ensure you have:

Python 3.8+ installed on your system
Basic understanding of performance optimization concepts
A text editor or IDE for Python development

This challenge uses only Python standard library - no external dependencies required!

Installation

Clone the repository

Clone or download the challenge repository to your local machine:

git clone <repository-url>
cd performance-takehome

Verify your setup

Run the baseline tests to ensure everything is working:

python perf_takehome.py Tests.test_kernel_cycles

You should see output showing the baseline performance:

forest_height=10, rounds=16, batch_size=256
CYCLES:  147734
Speedup over baseline:  1.0

The baseline implementation achieves 147,734 cycles - your goal is to optimize this!

Explore the codebase

Familiarize yourself with the key files:

perf_takehome.py - Main kernel builder and test harness
problem.py - Simulator and reference implementations
tests/submission_tests.py - Validation tests for your solution

Running the Baseline Test

The main test runs a tree traversal simulation on a custom VLIW SIMD architecture. Here’s what happens:

# From perf_takehome.py - the baseline test
def do_kernel_test(
    forest_height: int,
    rounds: int,
    batch_size: int,
    seed: int = 123,
    trace: bool = False,
    prints: bool = False,
):
    print(f"{forest_height=}, {rounds=}, {batch_size=}")
    random.seed(seed)
    forest = Tree.generate(forest_height)
    inp = Input.generate(forest, batch_size, rounds)
    mem = build_mem_image(forest, inp)

    kb = KernelBuilder()
    kb.build_kernel(forest.height, len(forest.values), len(inp.indices), rounds)
    
    machine = Machine(mem, kb.instrs, kb.debug_info(), n_cores=N_CORES)
    machine.run()
    
    print("CYCLES: ", machine.cycle)
    print("Speedup over baseline: ", BASELINE / machine.cycle)
    return machine.cycle

Test Parameters

Forest height: 10 (creates a binary tree with 2,047 nodes)
Rounds: 16 (iterations through the tree)
Batch size: 256 (parallel traversals)

Your First Optimization

Let’s make a simple optimization to understand the workflow. The baseline uses a scalar implementation - let’s explore vectorization.

Understanding the Current Implementation

The build_kernel method in KernelBuilder class generates instructions for a scalar ALU implementation:

# From perf_takehome.py:88-94
def build_kernel(
    self, forest_height: int, n_nodes: int, batch_size: int, rounds: int
):
    """
    Like reference_kernel2 but building actual instructions.
    Scalar implementation using only scalar ALU and load/store.
    """

The kernel processes each batch element sequentially:

# From perf_takehome.py:134-169
for round in range(rounds):
    for i in range(batch_size):
        # Load index and value
        # idx = mem[inp_indices_p + i]
        body.append(("alu", ("+", tmp_addr, self.scratch["inp_indices_p"], i_const)))
        body.append(("load", ("load", tmp_idx, tmp_addr)))
        
        # val = mem[inp_values_p + i]
        body.append(("alu", ("+", tmp_addr, self.scratch["inp_values_p"], i_const)))
        body.append(("load", ("load", tmp_val, tmp_addr)))
        
        # Process hash and tree traversal...

Optimization Strategy

The simulator supports SIMD operations with VLEN = 8, allowing you to process 8 elements simultaneously using:

vload / vstore - Vector load/store operations
valu - Vector ALU operations
vbroadcast - Broadcast scalar to vector

Start with small optimizations and validate frequently. The simulator is complex and incorrect optimizations can lead to wrong results.

Validating Your Results

Always validate your optimizations using the submission tests:

Check for test modifications

Ensure you haven’t accidentally modified the test files:

git diff origin/main tests/

This should be empty. LLMs have been known to modify tests to make the problem easier!

Run submission tests

Execute the official validation tests:

python tests/submission_tests.py

This runs correctness tests and evaluates your performance against benchmarks:

Testing forest_height=10, rounds=16, batch_size=256
CYCLES:  <your_cycle_count>
Speedup over baseline:  <your_speedup>

Check which thresholds you pass

The submission tests include multiple performance thresholds:

147,734 cycles: Baseline (starting point)
18,532 cycles: Updated take-home starting point (7.97x faster)
2,164 cycles: Claude Opus 4 after many hours
1,790 cycles: Claude Opus 4.5 casual session (best human 2hr performance)
1,487 cycles: Claude Opus 4.5 after 11.5 hours
1,363 cycles: Claude Opus 4.5 improved harness

Debug Workflow

For detailed debugging, use the trace visualization:

# Run test with tracing enabled
python perf_takehome.py Tests.test_kernel_trace

# In another terminal, start the trace viewer (Chrome only)
python watch_trace.py

This opens an interactive Perfetto trace showing:

Instruction execution per cycle
Engine utilization (ALU, load, store, flow)
Scratch space variable changes
Performance bottlenecks

The trace hot-reloads automatically when you re-run tests, making it ideal for iterative debugging.

Understanding the Architecture

The simulator models a VLIW (Very Long Instruction Word) SIMD architecture with:

Multiple engines executing in parallel per cycle
Slot limits per engine (e.g., 12 ALU slots, 2 load slots, 2 store slots)
Scratch space (1,536 32-bit words) serving as registers
Vector operations processing 8 elements (VLEN=8) at once

Engine limits from problem.py:

SLOT_LIMITS = {
    "alu": 12,      # Scalar arithmetic/logic operations
    "valu": 6,      # Vector arithmetic/logic operations
    "load": 2,      # Memory load operations
    "store": 2,     # Memory store operations
    "flow": 1,      # Control flow operations
    "debug": 64,    # Debug instructions (ignored in submission)
}

Next Steps

Now that you have the baseline running, explore these optimization strategies:

Architecture Deep Dive

Learn about the VLIW SIMD simulator and instruction set

Kernel Development

Master the KernelBuilder API and optimization techniques

Performance Benchmarks

See what performance levels are achievable

Debugging Guide

Use trace visualization and debugging tools effectively

Stuck? Review the reference implementations in problem.py:

reference_kernel() - High-level Python implementation
reference_kernel2() - Flat memory implementation matching your kernel

Get Started

Challenge

Architecture

Kernel Development

Debugging

Prerequisites

Installation

Running the Baseline Test

Test Parameters

Your First Optimization

Understanding the Current Implementation

Optimization Strategy

Validating Your Results

Debug Workflow

Understanding the Architecture

Next Steps

Architecture Deep Dive

Kernel Development

Performance Benchmarks

Debugging Guide

Build docs developers (and LLMs) love

Get Started

Challenge

Architecture

Kernel Development

Debugging

​Prerequisites

​Installation

​Running the Baseline Test

​Test Parameters

​Your First Optimization

​Understanding the Current Implementation

​Optimization Strategy

​Validating Your Results

​Debug Workflow

​Understanding the Architecture

​Next Steps

Architecture Deep Dive

Kernel Development

Performance Benchmarks

Debugging Guide

Build docs developers (and LLMs) love

Prerequisites

Installation

Running the Baseline Test

Test Parameters

Your First Optimization

Understanding the Current Implementation

Optimization Strategy

Validating Your Results

Debug Workflow

Understanding the Architecture

Next Steps