The systolic array module implements a 2×2 grid of processing elements (PEs) that perform matrix multiplication using a systolic dataflow pattern. Data flows through the array in a wave-like fashion, with inputs entering from the left and weights from the top.
Module declaration
module systolic #(
parameter int SYSTOLIC_ARRAY_WIDTH = 2
)(
input logic clk,
input logic rst,
// Left inputs
input logic [15:0] sys_data_in_11,
input logic [15:0] sys_data_in_21,
input logic sys_start,
// Right outputs
output logic [15:0] sys_data_out_21,
output logic [15:0] sys_data_out_22,
output wire sys_valid_out_21,
output wire sys_valid_out_22,
// Top inputs
input logic [15:0] sys_weight_in_11,
input logic [15:0] sys_weight_in_12,
input logic sys_accept_w_1,
input logic sys_accept_w_2,
input logic sys_switch_in,
// Column enable
input logic [15:0] ub_rd_col_size_in,
input logic ub_rd_col_size_valid_in
);
Parameters
Width of the systolic array (number of PEs per row/column)
| Port | Width | Description |
|---|
sys_data_in_11 | [15:0] | Input data for row 1 (enters PE at position [1,1]) |
sys_data_in_21 | [15:0] | Input data for row 2 (enters PE at position [2,1]) |
sys_start | 1 | Start signal (valid) for input data, propagates left-to-right in row 1 |
| Port | Width | Description |
|---|
sys_weight_in_11 | [15:0] | Weight input for column 1 (enters PE at position [1,1]) |
sys_weight_in_12 | [15:0] | Weight input for column 2 (enters PE at position [1,2]) |
sys_accept_w_1 | 1 | Accept weight signal for column 1, propagates top-to-bottom |
sys_accept_w_2 | 1 | Accept weight signal for column 2, propagates top-to-bottom |
Control signals
| Port | Width | Description |
|---|
sys_switch_in | 1 | Switch signal to activate preloaded weights, propagates diagonally |
ub_rd_col_size_in | [15:0] | Number of columns to enable (1 or 2) |
ub_rd_col_size_valid_in | 1 | Valid signal for column size |
Output ports
Bottom edge outputs (partial sums)
| Port | Width | Description |
|---|
sys_data_out_21 | [15:0] | Accumulated result from PE [2,1] (bottom-left) |
sys_data_out_22 | [15:0] | Accumulated result from PE [2,2] (bottom-right) |
sys_valid_out_21 | 1 | Valid signal for sys_data_out_21 |
sys_valid_out_22 | 1 | Valid signal for sys_data_out_22 |
Architecture
PE grid layout
weight_in_11 weight_in_12
↓ ↓
[PE 1,1] ───→ [PE 1,2]
↓ ↓
data_in_11 → (not used)
data_in_21 → [PE 2,1] ───→ [PE 2,2]
↓ ↓
data_out_21 data_out_22
Dataflow pattern
- Inputs flow from left to right across each row
- Weights flow from top to bottom down each column
- Partial sums flow from top to bottom down each column
- Valid signals propagate with the data
PE interconnections
From ~https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/systolic.sv:56-134:
// PE [1,1] - top-left
pe pe11 (
.pe_psum_in(16'b0), // Top row starts with 0
.pe_input_in(sys_data_in_11), // Input from left edge
.pe_valid_in(sys_start), // Start signal
.pe_weight_in(sys_weight_in_11), // Weight from top edge
.pe_input_out(pe_input_out_11), // → PE [1,2]
.pe_psum_out(pe_psum_out_11), // → PE [2,1]
.pe_weight_out(pe_weight_out_11) // → PE [2,1]
);
// PE [2,1] - bottom-left
pe pe21 (
.pe_psum_in(pe_psum_out_11), // Accumulate from PE [1,1]
.pe_weight_in(pe_weight_out_11), // Weight from PE [1,1]
.pe_psum_out(sys_data_out_21) // → Output
);
// Similar for PE [1,2] and PE [2,2]
Operation modes
Weight preloading
- Assert
sys_accept_w_1 and/or sys_accept_w_2
- Drive weights on
sys_weight_in_* ports
- Weights load into shadow buffers column-by-column
- Weights propagate down each column to all PEs
Weight activation
- Assert
sys_switch_in high for one cycle
- All PEs switch from shadow to active weight registers
- Switch signal propagates diagonally through the array
Matrix multiplication
- Drive input activations on
sys_data_in_* ports
- Assert
sys_start to begin computation
- Results appear at
sys_data_out_* after propagation delay
- For 2×2 array, output appears after 3 clock cycles
Dynamic column sizing
The array supports disabling columns for smaller matrices:
always@(posedge clk or posedge rst) begin
if(ub_rd_col_size_valid_in) begin
pe_enabled <= (1 << ub_rd_col_size_in) - 1;
end
end
ub_rd_col_size_in = 1: Only column 1 enabled (pe_enabled = 2'b01)
ub_rd_col_size_in = 2: Both columns enabled (pe_enabled = 2'b11)
Timing example
For a 2×2 matrix multiplication A × B:
Cycle | Input | PE Activity | Output
------|---------------|--------------------|---------
0 | A[0,0] | PE11: A[0,0]×B[0,0]| -
1 | A[1,0], A[0,1]| PE11: A[0,1]×B[0,1]| -
| | PE21: A[1,0]×B[0,0]|
2 | A[1,1] | PE21: A[1,0]×B[0,1]| C[0,0]
| | PE22: A[1,1]×B[1,1]|
3 | - | - | C[1,0], C[1,1]
Signal propagation delays
- Input to output latency: 3 clock cycles (for 2×2 array)
- Weight loading: 1 cycle per row
- Weight switching: Combinational (0 cycles)
- PE - Processing element implementation
- TPU - Top-level integration
- Unified Buffer - Data source
Testing
See test files: