Module interface
Computational core
Multiply-accumulate operation
The PE performs the fundamental MAC operation every clock cycle:The fixed-point multiplier and adder operate combinationally within the same clock cycle, with the result registered on the next clock edge.
Weight storage
Dual-register design
Each PE contains two weight registers to support weight preloading:- Active register (
weight_reg_active): Used in MAC operations - Inactive register (
weight_reg_inactive): Receives new weights in the background
Weight loading mechanism
Weights are loaded in two phases:-
Preload phase: Weights flow through the array when
pe_accept_w_inis high -
Switch phase: When
pe_switch_ingoes high, inactive weights become active
Data flow directions
The PE has four directional interfaces corresponding to compass directions:North (input)
- Partial sums flow down from the PE above
- Weights propagate down during preload phase
- Accept signal enables weight reception
West (input)
- Input activations flow right across the array
- Valid signal indicates valid data
- Switch signal triggers weight activation
- Enable signal controls PE operation
South (output)
- Computed partial sums flow down to the next PE
- Weights continue propagating during preload
East (output)
- Input activations continue flowing right
- Valid signal propagates with data
- Switch signal propagates diagonally
Control signals
pe_enabled
Controls whether the PE participates in computation:- Used to disable columns when matrix width < array width
- Set based on
ub_rd_col_sizefrom the unified buffer
pe_valid_in
Indicates when input data is valid:- Only perform MAC when
pe_valid_inis high - Gates the output partial sum and propagated input
pe_switch_in
Triggers the weight register swap:- Combinational logic for same-cycle activation
- Allows inputs to load on the same cycle as switch
- Propagates diagonally through the array (top-left to bottom-right)
Pipeline behavior
The PE operates with a one-cycle latency:-
Cycle N: Inputs arrive at PE
pe_input_incontains activation valuepe_psum_incontains partial sum from abovepe_valid_inindicates valid data
-
Cycle N (combinational): MAC operation occurs
- Multiplication:
pe_input_in × weight_reg_active - Addition:
mult_out + pe_psum_in - Result available at
mac_out
- Multiplication:
-
Cycle N+1: Outputs registered
pe_psum_out←mac_outpe_input_out←pe_input_inpe_valid_out←pe_valid_in
Fixed-point arithmetic
All PE arithmetic uses Q8.8 fixed-point format:- Sign bit: 1 bit
- Integer part: 7 bits
- Fractional part: 8 bits
fxp_mul and fxp_add modules handle:
- Proper bit alignment
- Overflow detection (unused in current design)
- Rounding to maintain precision
Usage example
In a 2×2 systolic array computing matrix multiplication C = A × B:- Setup: Load weights (B matrix values) into inactive registers
- Activate: Assert
pe_switch_into activate weights - Compute: Stream A matrix values horizontally
- Each PE multiplies its input by its weight
- Partial sums accumulate vertically
- Output: Final sums emerge from bottom of array
Reset behavior
On reset (rst = 1), all registers clear:
- Both weight registers → 16’b0
- All outputs → 0
- Valid signals → 0