VLIW & SIMD Concepts

What is VLIW?

VLIW (Very Large Instruction Word) is a parallel execution model where multiple operations execute simultaneously within a single instruction cycle. In this architecture, each “instruction” is actually an instruction bundle containing operations for multiple engines:

# Example instruction bundle - all operations execute in parallel
instr = {
    "valu": [("*", 4, 0, 0), ("+", 8, 4, 0)],
    "load": [("load", 16, 17)]
}

This single instruction executes:

Two vector ALU operations in parallel
One load operation in parallel

All happening in the same cycle.

Engine Slot Limits

Each engine can execute multiple “slots” per cycle, limited by hardware constraints:

problem.py

SLOT_LIMITS = {
    "alu": 12,      # 12 scalar ALU operations per cycle
    "valu": 6,      # 6 vector ALU operations per cycle
    "load": 2,      # 2 load operations per cycle
    "store": 2,     # 2 store operations per cycle
    "flow": 1,      # 1 control flow operation per cycle
    "debug": 64,    # 64 debug operations per cycle (free)
}

The simulator enforces these limits with assertions. Exceeding a slot limit will cause a runtime error.

Maximizing Parallelism

To optimize performance, pack as many operations as possible into each instruction bundle:

Poor packing (6 cycles)
Good packing (1 cycle)

# Each operation in its own instruction
[
    {"alu": [("+", 0, 1, 2)]},
    {"alu": [("-", 3, 4, 5)]},
    {"alu": [("*", 6, 7, 8)]},
    {"load": [("load", 9, 10)]},
    {"load": [("load", 11, 12)]},
    {"store": [("store", 13, 14)]}
]

# All operations in one instruction bundle
[{
    "alu": [
        ("+", 0, 1, 2),
        ("-", 3, 4, 5),
        ("*", 6, 7, 8)
    ],
    "load": [
        ("load", 9, 10),
        ("load", 11, 12)
    ],
    "store": [("store", 13, 14)]
}]

The well-packed version executes 6× faster!

What is SIMD?

SIMD (Single Instruction Multiple Data) allows one instruction to operate on multiple data elements simultaneously.

Vector Length

problem.py

VLEN = 8  # All vector operations work on 8 elements

Vector operations process 8 consecutive 32-bit words in scratch space:

# Vector add: adds 8 pairs of elements in parallel
("valu", ("+", dest, a, b))

# Equivalent to 8 scalar operations:
# dest[0] = a[0] + b[0]
# dest[1] = a[1] + b[1]
# ...
# dest[7] = a[7] + b[7]

Vector vs Scalar Operations

Scalar ALU
Vector ALU

Operations: Single 32-bit word operationsSlots: 12 per cycleExample:

# Multiply two scalars
("alu", ("*", dest, a, b))
# dest ← a × b (single multiplication)

Use for: Control logic, addresses, loop counters, single values

Operations: 8-element vector operationsSlots: 6 per cycleExample:

# Multiply two vectors
("valu", ("*", dest, a, b))
# dest[i] ← a[i] × b[i] for i in 0..7 (8 multiplications)

Use for: Data-parallel computations, batch processing

One valu operation does the work of 8 scalar operations using just 1 slot!

Vector Operation Examples

Broadcasting

Copy a scalar value to all elements of a vector:

problem.py

# vbroadcast: dest[i] = src for all i in 0..7
("valu", ("vbroadcast", dest, src))

# Implementation:
for i in range(VLEN):
    self.scratch_write[dest + i] = core.scratch[src]

Vector Memory Operations

Load/store 8 contiguous elements:

problem.py

# Load 8 consecutive memory locations
("load", ("vload", dest, addr))
# dest[i] = mem[addr + i] for i in 0..7

# Store 8 consecutive memory locations
("store", ("vstore", addr, src))
# mem[addr + i] = src[i] for i in 0..7

Vector load/store only supports contiguous memory access. For non-contiguous (strided or gathered) access, you must use scalar loads.

Fused Multiply-Add

A special high-performance operation:

problem.py

# multiply_add: dest[i] = (a[i] * b[i]) + c[i]
("valu", ("multiply_add", dest, a, b, c))

# Does work of 2 vector ops in 1 slot!

Instruction Packing Example

Here’s how the reference kernel builds an instruction from slots:

perf_takehome.py

class KernelBuilder:
    def build(self, slots: list[tuple[Engine, tuple]], vliw: bool = False):
        # Simple slot packing: one slot per instruction bundle
        instrs = []
        for engine, slot in slots:
            instrs.append({engine: [slot]})
        return instrs

This naive implementation creates one instruction per slot. For better performance, you should pack multiple slots into each instruction:

# Instead of:
slots = [
    ("alu", ("+", 0, 1, 2)),
    ("alu", ("-", 3, 4, 5)),
    ("load", ("load", 6, 7))
]
instructions = [{"alu": [("+", 0, 1, 2)]},
                {"alu": [("-", 3, 4, 5)]},
                {"load": [("load", 6, 7)]}]

# Do this:
instructions = [{
    "alu": [
        ("+", 0, 1, 2),
        ("-", 3, 4, 5)
    ],
    "load": [("load", 6, 7)]
}]

Performance Implications

Throughput Comparison

Operation Type	Slots/Cycle	Elements/Slot	Throughput
Scalar ALU	12	1	12 ops/cycle
Vector ALU	6	8	48 ops/cycle
Scalar Load	2	1	2 loads/cycle
Vector Load	2	8	16 loads/cycle

Vector operations provide 4× throughput for data-parallel work!

When to Use Each

Use scalar operations for...

Computing memory addresses
Loop counters and conditions
Branch decisions
Single values that don’t have parallel analogs
Operations that can’t be vectorized (dependencies)

Use vector operations for...

Processing batches of independent data
Element-wise array operations
Loading/storing contiguous memory blocks
Parallel hash computations
Any operation that can be expressed as “do the same thing to N items”

Get Started

Challenge

Architecture

Kernel Development

Debugging

What is VLIW?

Engine Slot Limits

Maximizing Parallelism

What is SIMD?

Vector Length

Vector vs Scalar Operations

Vector Operation Examples

Broadcasting

Vector Memory Operations

Fused Multiply-Add

Instruction Packing Example

Performance Implications

Throughput Comparison

When to Use Each

Next Steps

Memory Model

Instruction Set

Build docs developers (and LLMs) love

Get Started

Challenge

Architecture

Kernel Development

Debugging

​What is VLIW?

​Engine Slot Limits

​Maximizing Parallelism

​What is SIMD?

​Vector Length

​Vector vs Scalar Operations

​Vector Operation Examples

​Broadcasting

​Vector Memory Operations

​Fused Multiply-Add

​Instruction Packing Example

​Performance Implications

​Throughput Comparison

​When to Use Each

​Next Steps

Memory Model

Instruction Set

Build docs developers (and LLMs) love

What is VLIW?

Engine Slot Limits

Maximizing Parallelism

What is SIMD?

Vector Length

Vector vs Scalar Operations

Vector Operation Examples

Broadcasting

Vector Memory Operations

Fused Multiply-Add

Instruction Packing Example

Performance Implications

Throughput Comparison

When to Use Each

Next Steps