Control unit

The control unit decodes instructions and generates control signals for all other TPU components. It implements a simple but complete instruction set architecture (ISA) for controlling matrix operations and neural network training.

Module interface

module control_unit (
    input logic [87:0] instruction,  // 88-bit instruction word
    
    // 1-bit control signals
    output logic sys_switch_in,
    output logic ub_rd_start_in,
    output logic ub_rd_transpose,
    output logic ub_wr_host_valid_in_1,
    output logic ub_wr_host_valid_in_2,
    
    // Multi-bit control fields
    output logic [1:0] ub_rd_col_size,
    output logic [7:0] ub_rd_row_size,
    output logic [1:0] ub_rd_addr_in,
    output logic [2:0] ub_ptr_sel,
    output logic [15:0] ub_wr_host_data_in_1,
    output logic [15:0] ub_wr_host_data_in_2,
    output logic [3:0] vpu_data_pathway,
    output logic [15:0] inv_batch_size_times_two_in,
    output logic [15:0] vpu_leak_factor_in
);

Source: control_unit.sv:4-34

The control unit is purely combinational - it decodes the instruction word into output signals without any sequential logic or state.

Instruction format

The ISA uses 88-bit wide instructions (documented as 94 bits in README, implementation is 88 bits):

Bits [87:0] - Complete instruction word

Each instruction directly encodes all control signals needed for one operation.

Bit field allocation

Bits [0-4]: Single-bit control signals (5 bits)

assign sys_switch_in = instruction[0];
assign ub_rd_start_in = instruction[1];
assign ub_rd_transpose = instruction[2];
assign ub_wr_host_valid_in_1 = instruction[3];
assign ub_wr_host_valid_in_2 = instruction[4];

Source: control_unit.sv:38-42

Bit	Signal	Function
0	`sys_switch_in`	Activate preloaded weights in systolic array
1	`ub_rd_start_in`	Start a unified buffer read operation
2	`ub_rd_transpose`	Transpose matrix during UB read
3	`ub_wr_host_valid_in_1`	Host write valid for column 1
4	`ub_wr_host_valid_in_2`	Host write valid for column 2

Bits [5-6]: Column size (2 bits)

assign ub_rd_col_size = instruction[6:5];

Source: control_unit.sv:45 Specifies number of columns to read (0-3):

2'b00 = 0 columns
2'b01 = 1 column
2'b10 = 2 columns
2'b11 = 3 columns

Bits [7-14]: Row size (8 bits)

assign ub_rd_row_size = instruction[14:7];

Source: control_unit.sv:48 Specifies number of rows to read (0-255).

Bits [15-16]: Read address (2 bits)

assign ub_rd_addr_in = instruction[16:15];

Source: control_unit.sv:51 Starting address in unified buffer (0-3 in this implementation).

The address field is only 2 bits in the control unit implementation, limiting addresses to 0-3, but the unified buffer interface expects 16 bits. This appears to be a mismatch between the control unit and other modules.

Bits [17-19]: Pointer select (3 bits)

assign ub_ptr_sel = instruction[19:17];

Source: control_unit.sv:54 Selects which unified buffer read pointer to configure:

3'b000 = Input data pointer
3'b001 = Weight data pointer
3'b010 = Bias pointer
3'b011 = Y (target) pointer
3'b100 = H (activation) pointer
3'b101 = Gradient bias pointer
3'b110 = Gradient weight pointer

Bits [20-35]: Host write data 1 (16 bits)

assign ub_wr_host_data_in_1 = instruction[35:20];

Source: control_unit.sv:57 Data word to write to unified buffer column 1 when ub_wr_host_valid_in_1 is set.

Bits [36-51]: Host write data 2 (16 bits)

assign ub_wr_host_data_in_2 = instruction[51:36];

Source: control_unit.sv:60 Data word to write to unified buffer column 2 when ub_wr_host_valid_in_2 is set.

Bits [52-55]: VPU data pathway (4 bits)

assign vpu_data_pathway = instruction[55:52];

Source: control_unit.sv:63 Controls which VPU modules are active:

Bit [3]: Bias module
Bit [2]: Leaky ReLU module
Bit [1]: Loss derivative module
Bit [0]: Leaky ReLU derivative module

Common values:

4'b1100 = Forward pass (bias + ReLU)
4'b1111 = Transition (all modules)
4'b0001 = Backward pass (ReLU derivative only)
4'b0000 = No operation

Bits [56-71]: Inverse batch size × 2 (16 bits)

assign inv_batch_size_times_two_in = instruction[71:56];

Source: control_unit.sv:66 Precomputed constant for MSE loss gradient: 2 / batch_size in Q8.8 fixed-point format. Example: For batch_size = 32:

2/32 = 0.0625 = 0x0010 in Q8.8

Bits [72-87]: VPU leak factor (16 bits)

assign vpu_leak_factor_in = instruction[87:72];

Source: control_unit.sv:69 Leaky ReLU leak factor (α) in Q8.8 fixed-point format. Example: For α = 0.01:

0.01 ≈ 0x0003 in Q8.8

Instruction encoding

Complete bit layout

┌─────────────────────────────────────────────────────────────┐
│ Bit 87                                              Bit 0    │
├──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬─────┤
│ Leak │Inv BS│ VPU  │ Data │ Data │Ptr   │Addr  │ Row  │Col+1│
│Factor│ ×2   │Pathwy│  2   │  1   │Sel   │      │ Size │bit  │
│16bit │16bit │4bit  │16bit │16bit │3bit  │2bit  │8bit  │5bit │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴─────┘
 87-72  71-56  55-52  51-36  35-20  19-17  16-15  14-7   6-0

Example instructions

Load weights from host

Write weight values 0x1000 and 0x2000 to unified buffer:

instruction = {
    16'h0000,  // [87:72] leak_factor (unused)
    16'h0000,  // [71:56] inv_batch_size (unused)
    4'b0000,   // [55:52] vpu_pathway (unused)
    16'h2000,  // [51:36] host_data_2
    16'h1000,  // [35:20] host_data_1
    3'b000,    // [19:17] ptr_sel (unused)
    2'b00,     // [16:15] addr (unused)
    8'h00,     // [14:7]  row_size (unused)
    2'b00,     // [6:5]   col_size (unused)
    1'b0,      // [4]     host_valid_2 = 0
    1'b1,      // [3]     host_valid_1 = 1
    1'b0,      // [2]     transpose = 0
    1'b0,      // [1]     rd_start = 0
    1'b0       // [0]     switch = 0
};

Start systolic array read

Read 2×2 input matrix from address 0 (no transpose):

instruction = {
    16'h0000,  // [87:72] leak_factor
    16'h0000,  // [71:56] inv_batch_size
    4'b0000,   // [55:52] vpu_pathway
    16'h0000,  // [51:36] host_data_2
    16'h0000,  // [35:20] host_data_1
    3'b000,    // [19:17] ptr_sel = 0 (input pointer)
    2'b00,     // [16:15] addr = 0
    8'h02,     // [14:7]  row_size = 2
    2'b10,     // [6:5]   col_size = 2
    1'b0,      // [4]     host_valid_2 = 0
    1'b0,      // [3]     host_valid_1 = 0
    1'b0,      // [2]     transpose = 0
    1'b1,      // [1]     rd_start = 1
    1'b0       // [0]     switch = 0
};

Activate weights

Switch systolic array to use preloaded weights:

instruction = {
    16'h0000,  // All other fields zero
    // ...
    1'b0,      // [4]     host_valid_2 = 0
    1'b0,      // [3]     host_valid_1 = 0
    1'b0,      // [2]     transpose = 0
    1'b0,      // [1]     rd_start = 0
    1'b1       // [0]     switch = 1
};

Instruction sequencing

Instructions are loaded from an instruction buffer in testbenches:

# From test_tpu.py
instructions = [
    load_weight_instruction,
    switch_weights_instruction,
    read_input_instruction,
    configure_vpu_instruction,
    # ...
]

See tests/test_tpu.py for complete instruction sequences implementing forward and backward passes.

Design philosophy

VLIW-style encoding

The instruction format resembles Very Long Instruction Word (VLIW) architectures:

Each instruction is very wide (88 bits)
All control signals encoded directly
No instruction decode complexity
Single-cycle decode (combinational)

Advantages

Simple hardware: No state machines or complex decode logic
Deterministic timing: One instruction = one operation
Flexible control: Can configure all units simultaneously
Easy debugging: Instructions are human-readable bit patterns

Tradeoffs

Large instruction size: 88 bits per instruction
Low code density: Many bits unused in each instruction
No instruction reuse: No loops or subroutines in hardware
Host-dependent: Requires external instruction generation

The large instruction width is acceptable because instructions are generated by software and stored off-chip. The simplicity of hardware decode is more important than code density for this architecture.

Future improvements

From the README, future work includes:

Compiler: Automatic instruction generation from high-level operations
```
# High-level API (future)
tpu.load_weights(W1, address=8)
tpu.forward_pass(X, W1, b1)
```
Would compile to sequence of 88-bit instructions.
Extended addressing: Increase address field width for larger buffers
Compressed encoding: Add instruction compression for repeated patterns

See the instruction set documentation for:

Complete ISA reference
Field encoding details
Example instruction sequences
Assembly format (if implemented)

Get Started

Architecture

Instruction Set

Development

Module interface

Instruction format

Bit field allocation

Bits [0-4]: Single-bit control signals (5 bits)

Bits [5-6]: Column size (2 bits)

Bits [7-14]: Row size (8 bits)

Bits [15-16]: Read address (2 bits)

Bits [17-19]: Pointer select (3 bits)

Bits [20-35]: Host write data 1 (16 bits)

Bits [36-51]: Host write data 2 (16 bits)

Bits [52-55]: VPU data pathway (4 bits)

Bits [56-71]: Inverse batch size × 2 (16 bits)

Bits [72-87]: VPU leak factor (16 bits)

Instruction encoding

Complete bit layout

Example instructions

Load weights from host

Start systolic array read

Activate weights

Instruction sequencing

Design philosophy

VLIW-style encoding

Advantages

Tradeoffs

Future improvements

Build docs developers (and LLMs) love

Get Started

Architecture

Instruction Set

Development

​Module interface

​Instruction format

​Bit field allocation

​Bits [0-4]: Single-bit control signals (5 bits)

​Bits [5-6]: Column size (2 bits)

​Bits [7-14]: Row size (8 bits)

​Bits [15-16]: Read address (2 bits)

​Bits [17-19]: Pointer select (3 bits)

​Bits [20-35]: Host write data 1 (16 bits)

​Bits [36-51]: Host write data 2 (16 bits)

​Bits [52-55]: VPU data pathway (4 bits)

​Bits [56-71]: Inverse batch size × 2 (16 bits)

​Bits [72-87]: VPU leak factor (16 bits)

​Instruction encoding

​Complete bit layout

​Example instructions

​Load weights from host

​Start systolic array read

​Activate weights

​Instruction sequencing

​Design philosophy

​VLIW-style encoding

​Advantages

​Tradeoffs

​Future improvements

​Related documentation

Build docs developers (and LLMs) love

Module interface

Instruction format

Bit field allocation

Bits [0-4]: Single-bit control signals (5 bits)

Bits [5-6]: Column size (2 bits)

Bits [7-14]: Row size (8 bits)

Bits [15-16]: Read address (2 bits)

Bits [17-19]: Pointer select (3 bits)

Bits [20-35]: Host write data 1 (16 bits)

Bits [36-51]: Host write data 2 (16 bits)

Bits [52-55]: VPU data pathway (4 bits)

Bits [56-71]: Inverse batch size × 2 (16 bits)

Bits [72-87]: VPU leak factor (16 bits)

Instruction encoding

Complete bit layout

Example instructions

Load weights from host

Start systolic array read

Activate weights

Instruction sequencing

Design philosophy

VLIW-style encoding

Advantages

Tradeoffs

Future improvements

Related documentation