Instruction Format
Instructions are dictionaries mapping engine names to lists of operation tuples:problem.py
General rule: Except for
const, jump, and store, the first operand is the destination and subsequent operands are sources.Engine Slot Limits
Each engine has hardware limits on parallel execution:problem.py
ALU Engine (Scalar)
Slots per cycle: 12 Format:("alu", (operation, dest, src1, src2))
All scalar operations on 32-bit integers with modular arithmetic (wrap at 2³²).
Arithmetic Operations
| Operation | Syntax | Description | Example |
|---|---|---|---|
+ | ("+", dest, a, b) | Addition (modulo 2³²) | scratch[dest] = (scratch[a] + scratch[b]) % 2³² |
- | ("-", dest, a, b) | Subtraction (modulo 2³²) | scratch[dest] = (scratch[a] - scratch[b]) % 2³² |
* | ("*", dest, a, b) | Multiplication (modulo 2³²) | scratch[dest] = (scratch[a] * scratch[b]) % 2³² |
// | ("//", dest, a, b) | Integer division | scratch[dest] = scratch[a] // scratch[b] |
cdiv | ("cdiv", dest, a, b) | Ceiling division | scratch[dest] = (scratch[a] + scratch[b] - 1) // scratch[b] |
% | ("%", dest, a, b) | Modulo | scratch[dest] = scratch[a] % scratch[b] |
Bitwise Operations
| Operation | Syntax | Description | Example |
|---|---|---|---|
^ | ("^", dest, a, b) | XOR | scratch[dest] = scratch[a] ^ scratch[b] |
& | ("&", dest, a, b) | AND | scratch[dest] = scratch[a] & scratch[b] |
| | ("|", dest, a, b) | OR | scratch[dest] = scratch[a] | scratch[b] |
<< | ("<<", dest, a, b) | Left shift | scratch[dest] = scratch[a] << scratch[b] |
>> | (">>", dest, a, b) | Right shift (logical) | scratch[dest] = scratch[a] >> scratch[b] |
Comparison Operations
| Operation | Syntax | Description | Returns |
|---|---|---|---|
< | ("<", dest, a, b) | Less than | 1 if a < b, else 0 |
== | ("==", dest, a, b) | Equality | 1 if a == b, else 0 |
problem.py
VALU Engine (Vector)
Slots per cycle: 6 Vector length: 8 (VLEN = 8) Format:("valu", (operation, dest, src1, src2, ...))
Vector operations process 8 consecutive scratch addresses in parallel.
Binary Vector Operations
Most ALU operations have vector equivalents:| Operation | Syntax | Description |
|---|---|---|
+ | ("+", dest, a, b) | Vector addition |
- | ("-", dest, a, b) | Vector subtraction |
* | ("*", dest, a, b) | Vector multiplication |
// | ("//", dest, a, b) | Vector division |
% | ("%", dest, a, b) | Vector modulo |
^ | ("^", dest, a, b) | Vector XOR |
& | ("&", dest, a, b) | Vector AND |
| | ("|", dest, a, b) | Vector OR |
<< | ("<<", dest, a, b) | Vector left shift |
>> | (">>", dest, a, b) | Vector right shift |
< | ("<", dest, a, b) | Vector less than (returns 0/1 per element) |
== | ("==", dest, a, b) | Vector equality (returns 0/1 per element) |
problem.py
Special Vector Operations
vbroadcast - Scalar to Vector
vbroadcast - Scalar to Vector
Syntax: Example:
("vbroadcast", dest, src)Broadcast a scalar value to all vector elements:multiply_add - Fused Multiply-Add
multiply_add - Fused Multiply-Add
Syntax: Example:
("multiply_add", dest, a, b, c)Compute (a * b) + c for each vector element:This is a fused operation - does the work of 2 vector operations in 1 slot!
Load Engine
Slots per cycle: 2 Format:("load", (operation, ...))
Handles reading from memory and loading constants.
Scalar Load Operations
- load
- const
- load_offset
Syntax: Example:
("load", dest, addr)Load from memory address stored in scratch:Vector Load Operations
vload - Load Vector
vload - Load Vector
Syntax: Example:
("vload", dest, addr)Load 8 contiguous memory words:problem.py
Store Engine
Slots per cycle: 2 Format:("store", (operation, ...))
Handles writing to memory.
- store
- vstore
Syntax: Example:
("store", addr, src)Store to memory address stored in scratch:Unlike most operations,
addr is the first operand, not the destination!problem.py
Flow Engine
Slots per cycle: 1 Format:("flow", (operation, ...))
Handles control flow, conditional operations, and special instructions.
Conditional Operations
- select
- vselect
Syntax: Example:
("select", dest, cond, a, b)Scalar conditional select (ternary operator):Arithmetic with Immediate
add_imm - Add Immediate
add_imm - Add Immediate
Syntax: Example:
("add_imm", dest, a, imm)Add immediate value without loading constant:This saves a load slot compared to loading a constant and using ALU add.
Branch Operations
jump - Unconditional Jump
jump - Unconditional Jump
Syntax: Note:
("jump", addr)Jump to instruction address:addr is a direct instruction address, not a scratch address!cond_jump - Conditional Jump
cond_jump - Conditional Jump
Syntax:
("cond_jump", cond, addr)Jump if condition is non-zero:cond_jump_rel - Conditional Relative Jump
cond_jump_rel - Conditional Relative Jump
Syntax: Example use: Loop back by
("cond_jump_rel", cond, offset)Jump relative to current PC:-10 instructions.jump_indirect - Indirect Jump
jump_indirect - Indirect Jump
Syntax: Use case: Computed jumps, function pointers, dispatch tables.
("jump_indirect", addr)Jump to address stored in scratch:Control Operations
- halt
- pause
- trace_write
Syntax: Use at the end of your program.
("halt",)Stop core execution:Core Information
coreid - Get Core ID
coreid - Get Core ID
Syntax: Use case: Multi-core programs where each core needs its ID for work partitioning.
("coreid", dest)Get current core ID:problem.py
Debug Engine
Slots per cycle: 64 Cycle cost: 0 (debug operations are free!) Format:("debug", (operation, ...))
Debug Operations
- compare
- vcompare
- comment
Syntax: Example:
("compare", loc, key)Assert scalar value matches reference trace:Debug instructions are disabled in the submission harness (via
enable_debug=False), so you can leave them in your code without performance impact.Real-World Example: Hash Function
Here’s how the reference kernel implements a hash stage:perf_takehome.py
Optimization Tips
Pack instructions into bundles
Pack instructions into bundles
The reference kernel creates one instruction per slot:Pack into a single bundle:
Vectorize when possible
Vectorize when possible
Replace scalar loops with vector operations:
Reuse constants
Reuse constants
Use
scratch_const() to avoid reloading:Use multiply_add when possible
Use multiply_add when possible
Minimize memory access
Minimize memory access
Keep frequently-used data in scratch space:
Next Steps
Simulator Overview
Learn about the execution model
Building Your Kernel
Start optimizing your implementation