Systolic array

The systolic array is a 2D grid of processing elements arranged to perform efficient matrix multiplication. Data flows rhythmically through the array in a systolic pattern, similar to how blood flows through the heart.

Module interface

module systolic #(
    parameter int SYSTOLIC_ARRAY_WIDTH = 2
)(
    input logic clk,
    input logic rst,

    // Input signals from left side of systolic array
    input logic [15:0] sys_data_in_11,
    input logic [15:0] sys_data_in_21,
    input logic sys_start,    // start signal

    output logic [15:0] sys_data_out_21,
    output logic [15:0] sys_data_out_22,
    output wire sys_valid_out_21,
    output wire sys_valid_out_22,

    // Input signals from top of systolic array
    input logic [15:0] sys_weight_in_11,
    input logic [15:0] sys_weight_in_12,
    input logic sys_accept_w_1,   // accept weight column 1
    input logic sys_accept_w_2,   // accept weight column 2

    input logic sys_switch_in,    // switch signal

    input logic [15:0] ub_rd_col_size_in,
    input logic ub_rd_col_size_valid_in
);

Source: systolic.sv:5-31

Array topology

The Tiny TPU implements a 2×2 systolic array with four processing elements:

        Weight In 11    Weight In 12
             ↓               ↓
           ┌───┐           ┌───┐
Data 11 →  │PE │  Data →   │PE │
           │11 │           │12 │
           └───┘           └───┘
             ↓               ↓
           PSum            PSum
             ↓               ↓
           ┌───┐           ┌───┐
Data 21 →  │PE │  Data →   │PE │
           │21 │           │22 │
           └───┘           └───┘
             ↓               ↓
          Out 21          Out 22

PE instantiation

The array instantiates four PE modules with carefully routed connections:

pe pe11 (
    .clk(clk),
    .rst(rst),
    .pe_enabled(pe_enabled[0]),
    .pe_valid_in(sys_start),
    .pe_valid_out(pe_valid_out_11),
    .pe_accept_w_in(sys_accept_w_1),
    .pe_switch_in(sys_switch_in),
    .pe_switch_out(pe_switch_out_11),
    .pe_input_in(sys_data_in_11),
    .pe_psum_in(16'b0),              // Top row: no partial sum input
    .pe_weight_in(sys_weight_in_11),
    .pe_input_out(pe_input_out_11),
    .pe_psum_out(pe_psum_out_11),
    .pe_weight_out(pe_weight_out_11)
);

Source: systolic.sv:56-74

The top row PEs receive pe_psum_in = 16'b0 since there are no PEs above them. Partial sums start accumulating from zero.

Data flow patterns

Horizontal flow (activations)

Input activations flow from left to right across rows:

sys_data_in_11 enters PE11
PE11 forwards to PE12 via pe_input_out_11
sys_data_in_21 enters PE21
PE21 forwards to PE22 via pe_input_out_21

Each PE delays the data by one clock cycle, creating a staggered pattern.

Vertical flow (partial sums)

Partial sums flow from top to bottom down columns:

PE11 outputs pe_psum_out_11 (first partial sum)
pe_psum_out_11 enters PE21 as pe_psum_in
PE21 adds its contribution and outputs sys_data_out_21

This accumulation implements the dot product:

sys_data_out_21 = (data_11 × weight_11) + (data_21 × weight_21)

Vertical flow (weights)

During weight loading, weights propagate from top to bottom:

sys_weight_in_11 enters PE11
PE11 forwards to PE21 via pe_weight_out_11
Both PEs store the weight in their inactive registers

pe pe21 (
    // ...
    .pe_weight_in(pe_weight_out_11),  // Weight from PE above
    // ...
);

Source: systolic.sv:110

Diagonal flow (switch signal)

The sys_switch_in signal propagates diagonally (top-left to bottom-right):

PE11 receives sys_switch_in
  ↓
PE12 and PE21 receive pe_switch_out_11
  ↓
PE22 receives pe_switch_out_12

This ensures all PEs activate their weights in the correct sequence.

Weight management

Weight loading

Weights are loaded column-by-column using separate control signals:

input logic sys_accept_w_1,  // Enable weight loading for column 1 (PE11, PE21)
input logic sys_accept_w_2,  // Enable weight loading for column 2 (PE12, PE22)

Column 1 weights flow through:

sys_weight_in_11 → PE11 → PE21

Column 2 weights flow through:

sys_weight_in_12 → PE12 → PE22

Weight activation

After loading, assert sys_switch_in to activate all weights simultaneously:

input logic sys_switch_in;  // Copies weight from shadow buffer to active buffer

The switch signal propagates through the array:

PE11: Direct from input
PE12, PE21: From PE11’s pe_switch_out
PE22: From PE12’s pe_switch_out

Source: systolic.sv:85-86, systolic.sv:125

Column enable control

The array supports dynamic column disabling for matrices narrower than the array width:

logic [1:0] pe_enabled;  // Bit mask for enabled columns

always@(posedge clk or posedge rst) begin
    if(rst) begin
        pe_enabled <= '0;
    end else begin
        if(ub_rd_col_size_valid_in) begin
            pe_enabled <= (1 << ub_rd_col_size_in) - 1;
        end
    end
end

Source: systolic.sv:136-144

Examples:

ub_rd_col_size_in = 0: pe_enabled = 2'b00 (no columns)
ub_rd_col_size_in = 1: pe_enabled = 2'b01 (column 0 only)
ub_rd_col_size_in = 2: pe_enabled = 2'b11 (both columns)

Column enabling allows the same hardware to efficiently handle matrices of different widths without wasting computation or producing incorrect results.

Timing and staggering

Why staggering?

For correct matrix multiplication, inputs must arrive at each PE at the right time. Consider computing C = A × B:

A = [a11 a12]    B = [b11 b12]    C = [c11 c12]
    [a21 a22]        [b21 b22]        [c21 c22]

To compute c21 = a21×b11 + a22×b21:

PE21 needs a21 and b11 at time T
PE21 needs a22 and b21 at time T+1

But inputs flow horizontally with 1-cycle delay:

PE11 receives a11 at time T
PE12 receives a11 at time T+1

Solution: Stagger the input streams!

Input staggering

The unified buffer staggers inputs automatically:

Row 1 inputs start at time T
Row 2 inputs start at time T+1
Row 3 inputs start at time T+2
…

This ensures each PE receives the correct input at the correct time.

The unified buffer must implement proper staggering logic. See the unified buffer documentation for details on how rd_input_time_counter controls this.

Valid signal propagation

Valid signals propagate through the array to indicate when outputs are meaningful:

wire pe_valid_out_11;  // PE11 → PE12 and PE21
wire pe_valid_out_12;  // PE12 → PE22

output wire sys_valid_out_21;  // From PE21
output wire sys_valid_out_22;  // From PE22

Source: systolic.sv:50-51 Valid signals follow the same paths as data:

Horizontally: PE11 → PE12
Vertically: PE11 → PE21 → output
Horizontally then vertically: PE12 → PE22 → output

Matrix multiplication example

Computing a 2×2 matrix multiplication:

C = A × B
where A = [1 2], B = [5 6]
          [3 4]      [7 8]

Setup phase (cycle 0-1):

Load weights (B^T transposed):
  PE11 ← 5, PE12 ← 7
  PE21 ← 6, PE22 ← 8
Assert sys_switch_in to activate

Computation phase:

Cycle 2: Input a11=1 enters PE11
Cycle 3: Input a11=1 enters PE12, a12=2 enters PE11, a21=3 enters PE21
Cycle 4: Input a12=2 enters PE12, a22=4 enters PE11 and PE21
Cycle 5: Results start emerging from PE21 and PE22

Output phase:

sys_data_out_21 = (1×5) + (2×7) = 19  (c11)
sys_data_out_22 = (1×6) + (2×8) = 22  (c12)
(next cycle)
sys_data_out_21 = (3×5) + (4×7) = 43  (c21)
sys_data_out_22 = (3×6) + (4×8) = 50  (c22)

Scalability

The current implementation uses SYSTOLIC_ARRAY_WIDTH = 2, but the design can scale:

parameter int SYSTOLIC_ARRAY_WIDTH = 2  // 2×2 = 4 PEs

To scale to larger arrays:

Increase SYSTOLIC_ARRAY_WIDTH parameter
Add more PE instantiations
Update interconnect wiring
Adjust unified buffer dimensions

Scaling to 256×256 (65,536 PEs) is mentioned as a future goal, which would provide massive parallelism for neural network acceleration.

Get Started

Architecture

Instruction Set

Development

Module interface

Array topology

PE instantiation

Data flow patterns

Horizontal flow (activations)

Vertical flow (partial sums)

Vertical flow (weights)

Diagonal flow (switch signal)

Weight management

Weight loading

Weight activation

Column enable control

Examples:

Timing and staggering

Why staggering?

Input staggering

Valid signal propagation

Matrix multiplication example

Setup phase (cycle 0-1):

Computation phase:

Output phase:

Scalability

Build docs developers (and LLMs) love

Get Started

Architecture

Instruction Set

Development

​Module interface

​Array topology

​PE instantiation

​Data flow patterns

​Horizontal flow (activations)

​Vertical flow (partial sums)

​Vertical flow (weights)

​Diagonal flow (switch signal)

​Weight management

​Weight loading

​Weight activation

​Column enable control

​Examples:

​Timing and staggering

​Why staggering?

​Input staggering

​Valid signal propagation

​Matrix multiplication example

​Setup phase (cycle 0-1):

​Computation phase:

​Output phase:

​Scalability

Build docs developers (and LLMs) love

Module interface

Array topology

PE instantiation

Data flow patterns

Horizontal flow (activations)

Vertical flow (partial sums)

Vertical flow (weights)

Diagonal flow (switch signal)

Weight management

Weight loading

Weight activation

Column enable control

Examples:

Timing and staggering

Why staggering?

Input staggering

Valid signal propagation

Matrix multiplication example

Setup phase (cycle 0-1):

Computation phase:

Output phase:

Scalability