Machine

The Machine class simulates a custom Very Large Instruction Word (VLIW) architecture with SIMD capabilities. It executes programs built by KernelBuilder and provides tracing and debugging features.

Class: Machine

A cycle-accurate simulator for a custom VLIW SIMD processor with multiple execution engines.

Architecture Overview

VLIW (Very Large Instruction Word):

Multiple execution engines operate in parallel each cycle
Each engine can execute multiple “slots” per cycle
Slot limits defined in SLOT_LIMITS:
- alu: 12 slots (scalar arithmetic/logic)
- valu: 6 slots (vector arithmetic/logic)
- load: 2 slots (memory reads)
- store: 2 slots (memory writes)
- flow: 1 slot (control flow)
- debug: 64 slots (debug operations)

SIMD (Single Instruction, Multiple Data):

Vector operations process VLEN=8 elements simultaneously
Vector instructions: vload, vstore, valu, vselect, vbroadcast
Vectors occupy consecutive scratch addresses

Memory Model:

32-bit word addressable memory
Scratch space acts as registers (1536 words by default)
All writes take effect at end of cycle (no intra-cycle dependencies)

Constructor

machine = Machine(
    mem_dump,
    program,
    debug_info,
    n_cores=1,
    scratch_size=1536,
    trace=False,
    value_trace={}
)

mem_dump

list[int]

required

Initial memory contents (32-bit words)

program

list[Instruction]

required

List of instruction bundles to execute. Each instruction is a dict mapping engine names to lists of slots.

debug_info

DebugInfo

required

Debug information from KernelBuilder (scratch variable names)

n_cores

int

default:"1"

Number of processor cores (current version uses 1)

scratch_size

int

default:"1536"

Size of scratch space per core in 32-bit words

trace

bool

default:"False"

Enable Perfetto trace generation for visualization

value_trace

dict[Any, int]

default:"{}"

Dictionary for debug value tracking (used by debug engine)

problem.py:97-122

def __init__(
    self,
    mem_dump: list[int],
    program: list[Instruction],
    debug_info: DebugInfo,
    n_cores: int = 1,
    scratch_size: int = SCRATCH_SIZE,
    trace: bool = False,
    value_trace: dict[Any, int] = {},
):
    self.cores = [
        Core(id=i, scratch=[0] * scratch_size, trace_buf=[]) for i in range(n_cores)
    ]
    self.mem = copy(mem_dump)
    self.program = program
    self.debug_info = debug_info
    self.value_trace = value_trace
    self.prints = False
    self.cycle = 0
    self.enable_pause = True
    self.enable_debug = True
    if trace:
        self.setup_trace()
    else:
        self.trace = None

Execution Methods

run()

Executes the program until all cores halt or pause.

problem.py:197-217

def run(self):
    for core in self.cores:
        if core.state == CoreState.PAUSED:
            core.state = CoreState.RUNNING
    while any(c.state == CoreState.RUNNING for c in self.cores):
        has_non_debug = False
        for core in self.cores:
            if core.state != CoreState.RUNNING:
                continue
            if core.pc >= len(self.program):
                core.state = CoreState.STOPPED
                continue
            instr = self.program[core.pc]
            if self.prints:
                self.print_step(instr, core)
            core.pc += 1
            self.step(instr, core)
            if any(name != "debug" for name in instr.keys()):
                has_non_debug = True
        if has_non_debug:
            self.cycle += 1

The cycle counter only increments for instructions with non-debug operations. This is what the performance test measures.

step()

Executes a single instruction bundle across all engines.

instr

Instruction

required

Instruction bundle (dict mapping engines to slot lists)

core

Core

required

Core to execute on

problem.py:352-397

def step(self, instr: Instruction, core):
    """
    Execute all the slots in each engine for a single instruction bundle
    """
    ENGINE_FNS = {
        "alu": self.alu,
        "valu": self.valu,
        "load": self.load,
        "store": self.store,
        "flow": self.flow,
    }
    self.scratch_write = {}
    self.mem_write = {}
    for name, slots in instr.items():
        if name == "debug":
            if not self.enable_debug:
                continue
            for slot in slots:
                if slot[0] == "compare":
                    loc, key = slot[1], slot[2]
                    ref = self.value_trace[key]
                    res = core.scratch[loc]
                    assert res == ref, f"{res} != {ref} for {key} at pc={core.pc}"
                elif slot[0] == "vcompare":
                    loc, keys = slot[1], slot[2]
                    ref = [self.value_trace[key] for key in keys]
                    res = core.scratch[loc : loc + VLEN]
                    assert res == ref, (
                        f"{res} != {ref} for {keys} at pc={core.pc} loc={loc}"
                    )
            continue
        assert len(slots) <= SLOT_LIMITS[name]
        for i, slot in enumerate(slots):
            if self.trace is not None:
                self.trace_slot(core, slot, name, i)
            ENGINE_FNS[name](core, *slot)
    for addr, val in self.scratch_write.items():
        core.scratch[addr] = val
    for addr, val in self.mem_write.items():
        self.mem[addr] = val

    if self.trace:
        self.trace_post_step(instr, core)

    del self.scratch_write
    del self.mem_write

All writes (scratch and memory) take effect at the end of the cycle. Reading and writing the same location in one instruction uses the old value.

setup_trace()

Initializes Perfetto trace generation for visualization.

problem.py:151-196

def setup_trace(self):
    """
    The simulator generates traces in Chrome's Trace Event Format for
    visualization in Perfetto (or chrome://tracing if you prefer it). See
    the bottom of the file for info about how to use this.

    See the format docs in case you want to add more info to the trace:
    https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview
    """
    self.trace = open("trace.json", "w")
    self.trace.write("[")
    # ... trace initialization code ...

Run python perf_takehome.py Tests.test_kernel_trace then python watch_trace.py to view execution traces in Perfetto. This is the recommended debugging workflow.

Engine Methods

Each execution engine implements specific operations. All addresses refer to scratch space unless otherwise noted.

alu()

Scalar arithmetic and logic operations.

core

Core

required

Core executing the operation

str

required

Operation: +, -, *, //, cdiv, ^, &, |, <<, >>, %, <, ==

dest

int

required

Destination scratch address

int

required

First operand scratch address

int

required

Second operand scratch address

problem.py:219-252

def alu(self, core, op, dest, a1, a2):
    a1 = core.scratch[a1]
    a2 = core.scratch[a2]
    match op:
        case "+":
            res = a1 + a2
        case "-":
            res = a1 - a2
        case "*":
            res = a1 * a2
        case "//":
            res = a1 // a2
        case "cdiv":
            res = cdiv(a1, a2)
        case "^":
            res = a1 ^ a2
        case "&":
            res = a1 & a2
        case "|":
            res = a1 | a2
        case "<<":
            res = a1 << a2
        case ">>":
            res = a1 >> a2
        case "%":
            res = a1 % a2
        case "<":
            res = int(a1 < a2)
        case "==":
            res = int(a1 == a2)
        case _:
            raise NotImplementedError(f"Unknown alu op {op}")
    res = res % (2**32)
    self.scratch_write[dest] = res

Example:

{"alu": [("*", 2, 0, 1), ("+", 3, 2, 0)]}  # scratch[2] = scratch[0] * scratch[1]
                                             # scratch[3] = scratch[2] + scratch[0]

valu()

Vector arithmetic and logic operations (operates on VLEN=8 elements).

core

Core

required

Core executing the operation

slot

tuple

required

Vector operation slot

Supported operations:

("vbroadcast", dest, src): Broadcast scalar to vector
("multiply_add", dest, a, b, c): dest[i] = a[i] * b[i] + c[i]
(op, dest, a, b): Element-wise ALU operation

problem.py:254-267

def valu(self, core, *slot):
    match slot:
        case ("vbroadcast", dest, src):
            for i in range(VLEN):
                self.scratch_write[dest + i] = core.scratch[src]
        case ("multiply_add", dest, a, b, c):
            for i in range(VLEN):
                mul = (core.scratch[a + i] * core.scratch[b + i]) % (2**32)
                self.scratch_write[dest + i] = (mul + core.scratch[c + i]) % (2**32)
        case (op, dest, a1, a2):
            for i in range(VLEN):
                self.alu(core, op, dest + i, a1 + i, a2 + i)
        case _:
            raise NotImplementedError(f"Unknown valu op {slot}")

Example:

{"valu": [("vbroadcast", 8, 0), ("*", 16, 8, 24)]}  # Broadcast scratch[0] to scratch[8:16]
                                                      # scratch[16:24] = scratch[8:16] * scratch[24:32]

load()

Memory and constant loading operations. Supported operations:

("load", dest, addr): scratch[dest] = mem[scratch[addr]]
("load_offset", dest, addr, offset): scratch[dest+offset] = mem[scratch[addr+offset]]
("vload", dest, addr): Load 8 contiguous words
("const", dest, val): scratch[dest] = val

problem.py:269-286

def load(self, core, *slot):
    match slot:
        case ("load", dest, addr):
            self.scratch_write[dest] = self.mem[core.scratch[addr]]
        case ("load_offset", dest, addr, offset):
            self.scratch_write[dest + offset] = self.mem[
                core.scratch[addr + offset]
            ]
        case ("vload", dest, addr):  # addr is a scalar
            addr = core.scratch[addr]
            for vi in range(VLEN):
                self.scratch_write[dest + vi] = self.mem[addr + vi]
        case ("const", dest, val):
            self.scratch_write[dest] = (val) % (2**32)
        case _:
            raise NotImplementedError(f"Unknown load op {slot}")

store()

Memory write operations. Supported operations:

("store", addr, src): mem[scratch[addr]] = scratch[src]
("vstore", addr, src): Store 8 contiguous words

problem.py:288-298

def store(self, core, *slot):
    match slot:
        case ("store", addr, src):
            addr = core.scratch[addr]
            self.mem_write[addr] = core.scratch[src]
        case ("vstore", addr, src):  # addr is a scalar
            addr = core.scratch[addr]
            for vi in range(VLEN):
                self.mem_write[addr + vi] = core.scratch[src + vi]
        case _:
            raise NotImplementedError(f"Unknown store op {slot}")

flow()

Control flow and special operations. Supported operations:

("select", dest, cond, a, b): dest = a if cond != 0 else b
("vselect", dest, cond, a, b): Vector select
("add_imm", dest, a, imm): dest = a + imm (immediate)
("halt",): Stop execution
("pause",): Pause until next run() call
("jump", addr): Jump to instruction address
("cond_jump", cond, addr): Conditional jump
("cond_jump_rel", cond, offset): Relative conditional jump
("jump_indirect", addr): pc = scratch[addr]
("coreid", dest): dest = core.id
("trace_write", val): Append to trace buffer

problem.py:300-335

def flow(self, core, *slot):
    match slot:
        case ("select", dest, cond, a, b):
            self.scratch_write[dest] = (
                core.scratch[a] if core.scratch[cond] != 0 else core.scratch[b]
            )
        case ("add_imm", dest, a, imm):
            self.scratch_write[dest] = (core.scratch[a] + imm) % (2**32)
        case ("vselect", dest, cond, a, b):
            for vi in range(VLEN):
                self.scratch_write[dest + vi] = (
                    core.scratch[a + vi]
                    if core.scratch[cond + vi] != 0
                    else core.scratch[b + vi]
                )
        case ("halt",):
            core.state = CoreState.STOPPED
        case ("pause",):
            if self.enable_pause:
                core.state = CoreState.PAUSED
        # ... more flow operations ...

Core Class

Represents a single processor core.

int

Core identifier

scratch

list[int]

Scratch space (register file)

trace_buf

list[int]

Trace buffer for debug output

int

default:"0"

Program counter

state

CoreState

default:"RUNNING"

Current execution state

problem.py:24-30

@dataclass
class Core:
    id: int
    scratch: list[int]
    trace_buf: list[int]
    pc: int = 0
    state: CoreState = CoreState.RUNNING

CoreState Enum

Execution states for processor cores.

RUNNING

CoreState

Core is actively executing instructions

PAUSED

CoreState

Core is paused (hit a pause instruction)

STOPPED

CoreState

Core has halted or reached end of program

problem.py:18-22

class CoreState(Enum):
    RUNNING = 1
    PAUSED = 2
    STOPPED = 3

Example: Instruction Format

program = [
    # Load constants
    {"load": [("const", 0, 10), ("const", 1, 20)]},
    
    # Parallel ALU operations
    {"alu": [
        ("+", 2, 0, 1),   # scratch[2] = scratch[0] + scratch[1]
        ("*", 3, 0, 1),   # scratch[3] = scratch[0] * scratch[1]
    ]},
    
    # Vector broadcast and operation
    {"valu": [
        ("vbroadcast", 8, 0),  # Broadcast scratch[0] to scratch[8:16]
        ("+", 16, 8, 24),      # Vector add
    ]},
    
    # Memory operations
    {"load": [("vload", 32, 2)],
     "store": [("vstore", 3, 16)]},
    
    # Control flow
    {"flow": [("halt",)]}
]

machine = Machine(mem, program, debug_info)
machine.run()
print(f"Cycles: {machine.cycle}")

Performance Characteristics

Scalar ALU

12 operations per cycle

Vector ALU

6 operations per cycle (48 elements with VLEN=8)

Memory Bandwidth

2 loads + 2 stores per cycle (or 16 loads + 16 stores with vectors)

Control Flow

1 flow operation per cycle

Maximize instruction-level parallelism by packing multiple independent operations into each cycle. Use vector instructions to process 8x more data per cycle.

Core Classes

Instructions

Testing

Class: Machine

Architecture Overview

Constructor

Execution Methods

run()

step()

setup_trace()

Engine Methods

alu()

valu()

load()

store()

flow()

Core Class

CoreState Enum

Example: Instruction Format

Performance Characteristics

Scalar ALU

Vector ALU

Memory Bandwidth

Control Flow

Build docs developers (and LLMs) love

Core Classes

Instructions

Testing

​Class: Machine

​Architecture Overview

​Constructor

​Execution Methods

​run()

​step()

​setup_trace()

​Engine Methods

​alu()

​valu()

​load()

​store()

​flow()

​Core Class

​CoreState Enum

​Example: Instruction Format

​Performance Characteristics

Scalar ALU

Vector ALU

Memory Bandwidth

Control Flow

Build docs developers (and LLMs) love

Class: Machine

Architecture Overview

Constructor

Execution Methods

run()

step()

setup_trace()

Engine Methods

alu()

valu()

load()

store()

flow()

Core Class

CoreState Enum

Example: Instruction Format

Performance Characteristics