Skip to main content
The Processing Element (PE) is the basic computational building block of the systolic array. Each PE performs multiply-accumulate (MAC) operations using a weight stored in local registers and input data flowing from the west.

Module declaration

module pe #(
    parameter int DATA_WIDTH = 16
) (
    input logic clk,
    input logic rst,
    // North ports
    input logic signed [15:0] pe_psum_in,
    input logic signed [15:0] pe_weight_in,
    input logic pe_accept_w_in,
    // West ports
    input logic signed [15:0] pe_input_in,
    input logic pe_valid_in,
    input logic pe_switch_in,
    input logic pe_enabled,
    // South ports
    output logic signed [15:0] pe_psum_out,
    output logic signed [15:0] pe_weight_out,
    // East ports
    output logic signed [15:0] pe_input_out,
    output logic pe_valid_out,
    output logic pe_switch_out
);

Parameters

DATA_WIDTH
int
default:"16"
Bit width for data paths (currently unused in implementation)

Input ports

North ports (from PE above or top of array)

PortWidthDescription
pe_psum_insigned [15:0]Partial sum input from north
pe_weight_insigned [15:0]Weight value to load into shadow register
pe_accept_w_in1Accept weight signal - loads pe_weight_in into shadow buffer

West ports (from PE to the left or left edge of array)

PortWidthDescription
pe_input_insigned [15:0]Input activation value
pe_valid_in1Valid signal indicating input data is ready
pe_switch_in1Switch signal to copy shadow weight to active register
pe_enabled1Enable signal - when low, PE outputs zeros

Output ports

South ports (to PE below or bottom of array)

PortWidthDescription
pe_psum_outsigned [15:0]Partial sum output = pe_psum_in + (pe_input_in × weight_active)
pe_weight_outsigned [15:0]Weight value forwarded to PE below

East ports (to PE to the right or right edge of array)

PortWidthDescription
pe_input_outsigned [15:0]Input value forwarded to next PE
pe_valid_out1Valid signal forwarded to next PE
pe_switch_out1Switch signal forwarded to next PE

Architecture

Weight buffering

The PE uses a double-buffering scheme for weights:
  • Active register (weight_reg_active): Used for current computation
  • Shadow register (weight_reg_inactive): Holds next set of weights
This allows preloading weights while computation proceeds with the current weights.

MAC operation

The PE performs a multiply-accumulate operation:
  1. Multiply: mult_out = pe_input_in × weight_reg_active
  2. Accumulate: mac_out = mult_out + pe_psum_in
  3. Output: pe_psum_out = mac_out (when pe_valid_in is high)

Fixed-point arithmetic

All operations use Q8.8 fixed-point format (8 integer bits, 8 fractional bits):

Signal flow diagram

        North

      [psum_in]
     [weight_in]
   [accept_w_in]
          |
 West → [PE] → East
 [input]  |  [input]
 [valid]  |  [valid]
[switch]  |  [switch]

       [psum_out]
     [weight_out]
        South

Operation sequence

Weight loading phase

  1. Assert pe_accept_w_in high
  2. Drive pe_weight_in with weight value
  3. On clock edge, weight loads into weight_reg_inactive
  4. Weight propagates to pe_weight_out for PE below

Weight switching phase

  1. Assert pe_switch_in high
  2. Combinationally (same cycle), weight_reg_active ← weight_reg_inactive
  3. Switch signal propagates east

Computation phase

  1. Drive pe_input_in with activation value
  2. Assert pe_valid_in high
  3. On clock edge:
    • PE computes MAC operation
    • pe_psum_out = pe_psum_in + (pe_input_in × weight_reg_active)
    • Input and valid signals propagate east

Example instantiation

From ~https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/systolic.sv:56-74:
pe pe11 (
    .clk(clk),
    .rst(rst),
    .pe_enabled(pe_enabled[0]),
    .pe_valid_in(sys_start),
    .pe_valid_out(pe_valid_out_11),
    .pe_accept_w_in(sys_accept_w_1),
    .pe_switch_in(sys_switch_in),
    .pe_switch_out(pe_switch_out_11),
    .pe_input_in(sys_data_in_11),
    .pe_psum_in(16'b0),
    .pe_weight_in(sys_weight_in_11),
    .pe_input_out(pe_input_out_11),
    .pe_psum_out(pe_psum_out_11),
    .pe_weight_out(pe_weight_out_11)
);

Timing behavior

  • Weight loading: Registered (sequential, 1 cycle)
  • Weight switching: Combinational (0 cycles)
  • MAC operation: Registered (sequential, 1 cycle)
  • Signal propagation: Registered (sequential, 1 cycle)

Testing

See test files:

Build docs developers (and LLMs) love