The vector processing unit (VPU) performs element-wise operations on vectors after they exit the systolic array. The VPU implements a configurable pipeline of activation functions, loss derivatives, and bias addition.
Module interface
module vpu (
input logic clk,
input logic rst,
input logic [3:0] vpu_data_pathway, // Module selection control
// Inputs from systolic array
input logic signed [15:0] vpu_data_in_1,
input logic signed [15:0] vpu_data_in_2,
input logic vpu_valid_in_1,
input logic vpu_valid_in_2,
// Inputs from unified buffer
input logic signed [15:0] bias_scalar_in_1,
input logic signed [15:0] bias_scalar_in_2,
input logic signed [15:0] lr_leak_factor_in,
input logic signed [15:0] Y_in_1,
input logic signed [15:0] Y_in_2,
input logic signed [15:0] inv_batch_size_times_two_in,
input logic signed [15:0] H_in_1,
input logic signed [15:0] H_in_2,
// Outputs to unified buffer
output logic signed [15:0] vpu_data_out_1,
output logic signed [15:0] vpu_data_out_2,
output logic vpu_valid_out_1,
output logic vpu_valid_out_2
);
Source: vpu.sv:19-46
Pipeline modules
The VPU contains four pipelined modules that can be selectively activated:
1. Bias addition
Adds a bias scalar to each element:
Implemented by bias_parent module containing two bias_child instances (one per column):
bias_parent bias_parent_inst (
.clk(clk),
.rst(rst),
.bias_sys_data_in_1(bias_data_1_in),
.bias_sys_data_in_2(bias_data_2_in),
.bias_sys_valid_in_1(bias_valid_1_in),
.bias_sys_valid_in_2(bias_valid_2_in),
.bias_scalar_in_1(bias_scalar_in_1), // From UB
.bias_scalar_in_2(bias_scalar_in_2), // From UB
.bias_Z_valid_out_1(bias_valid_1_out),
.bias_Z_valid_out_2(bias_valid_2_out),
.bias_z_data_out_1(bias_z_data_out_1),
.bias_z_data_out_2(bias_z_data_out_2)
);
Source: vpu.sv:115-130
2. Leaky ReLU activation
Applies leaky ReLU activation function:
H(z) = z if z > 0
= z × leak if z ≤ 0
Implemented by leaky_relu_parent module:
leaky_relu_parent leaky_relu_parent_inst (
.clk(clk),
.rst(rst),
.lr_data_1_in(lr_data_1_in),
.lr_data_2_in(lr_data_2_in),
.lr_valid_1_in(lr_valid_1_in),
.lr_valid_2_in(lr_valid_2_in),
.lr_leak_factor_in(lr_leak_factor_in), // From UB
.lr_data_1_out(lr_data_1_out),
.lr_data_2_out(lr_data_2_out),
.lr_valid_1_out(lr_valid_1_out),
.lr_valid_2_out(lr_valid_2_out)
);
Source: vpu.sv:133-148
The leak factor is typically a small positive value (e.g., 0.01) that allows a small gradient for negative inputs, preventing “dead neurons” in the network.
3. Loss derivative (MSE)
Computes the derivative of mean squared error loss:
∂L/∂H = (2/batch_size) × (H - Y)
Implemented by loss_parent module:
loss_parent loss_parent_inst (
.clk(clk),
.rst(rst),
.H_1_in(loss_data_1_in),
.H_2_in(loss_data_2_in),
.valid_1_in(loss_valid_1_in),
.valid_2_in(loss_valid_2_in),
.Y_1_in(Y_in_1), // From UB
.Y_2_in(Y_in_2), // From UB
.inv_batch_size_times_two_in(inv_batch_size_times_two_in),
.gradient_1_out(loss_data_1_out),
.gradient_2_out(loss_data_2_out),
.valid_1_out(loss_valid_1_out),
.valid_2_out(loss_valid_2_out)
);
Source: vpu.sv:150-166
The module is currently named loss_parent but actually computes the loss derivative, not the loss value itself. This is noted in the code comments at line 150.
4. Leaky ReLU derivative
Computes the derivative of leaky ReLU:
∂H/∂Z = 1 if H > 0
= leak if H ≤ 0
Then multiplies by the gradient from the next layer:
Implemented by leaky_relu_derivative_parent module:
leaky_relu_derivative_parent leaky_relu_derivative_parent_inst (
.clk(clk),
.rst(rst),
.lr_d_data_1_in(lr_d_data_1_in),
.lr_d_data_2_in(lr_d_data_2_in),
.lr_d_valid_1_in(lr_d_valid_1_in),
.lr_d_valid_2_in(lr_d_valid_2_in),
.lr_d_H_1_in(lr_d_H_in_1), // From UB or cached
.lr_d_H_2_in(lr_d_H_in_2), // From UB or cached
.lr_leak_factor_in(lr_leak_factor_in),
.lr_d_data_1_out(lr_d_data_1_out),
.lr_d_data_2_out(lr_d_data_2_out),
.lr_d_valid_1_out(lr_d_valid_1_out),
.lr_d_valid_2_out(lr_d_valid_2_out)
);
Source: vpu.sv:168-185
Data pathways
The vpu_data_pathway 4-bit control signal selects which modules are active:
vpu_data_pathway[3:0]:
[3] - Bias module enable
[2] - Leaky ReLU module enable
[1] - Loss derivative module enable
[0] - Leaky ReLU derivative module enable
Source: vpu.sv:10-17
Pathway configurations
Forward pass pathway (4’b1100)
Systolic Array → Bias → Leaky ReLU → Output
Used for hidden layer computations during forward propagation.
vpu_data_pathway = 4'b1100; // Bias + Leaky ReLU
Transition pathway (4’b1111)
Systolic Array → Bias → Leaky ReLU → Loss → LR Derivative → Output
↓
(cache H matrix)
Used for the output layer, computing both forward pass and loss gradient.
vpu_data_pathway = 4'b1111; // All modules
Special behavior: Caches the H matrix output from leaky ReLU for use in the derivative computation.
Backward pass pathway (4’b0001)
Systolic Array → Leaky ReLU Derivative → Output
Used for hidden layer gradients during backpropagation.
vpu_data_pathway = 4'b0001; // LR Derivative only
No operation (4’b0000)
Systolic Array → (bypassed) → Output
Passes data through unchanged.
Routing logic
The VPU uses combinational logic to route data between modules based on the pathway control:
Bias module routing
if(vpu_data_pathway[3]) begin
// Connect vpu inputs to bias module
bias_data_1_in = vpu_data_in_1;
bias_data_2_in = vpu_data_in_2;
bias_valid_1_in = vpu_valid_in_1;
bias_valid_2_in = vpu_valid_in_2;
// Connect bias output to intermediate values
b_to_lr_data_in_1 = bias_z_data_out_1;
b_to_lr_data_in_2 = bias_z_data_out_2;
// ...
end else begin
// Bypass bias module
b_to_lr_data_in_1 = vpu_data_in_1;
b_to_lr_data_in_2 = vpu_data_in_2;
// ...
end
Source: vpu.sv:213-237
This pattern repeats for each module, creating a flexible pipeline where any combination of modules can be activated or bypassed.
H matrix caching
During the transition pathway, the VPU caches the H matrix (post-activation values) for use in computing activation derivatives:
if (vpu_data_pathway[1]) begin // Loss module active
// Cache and use 'last H matrix'
last_H_data_1_in = lr_data_1_out;
last_H_data_2_in = lr_data_2_out;
lr_d_H_in_1 = last_H_data_1_out;
lr_d_H_in_2 = last_H_data_2_out;
end else begin
// Use H matrix from unified buffer
lr_d_H_in_1 = H_in_1;
lr_d_H_in_2 = H_in_2;
end
Source: vpu.sv:281-302
The cache is implemented with sequential logic:
always @(posedge clk or posedge rst) begin
if (rst) begin
last_H_data_1_out <= 0;
last_H_data_2_out <= 0;
end else begin
if (vpu_data_pathway[1]) begin
last_H_data_1_out <= last_H_data_1_in;
last_H_data_2_out <= last_H_data_2_in;
end
end
end
Source: vpu.sv:333-347
Caching the H matrix in the VPU eliminates the need to write it to the unified buffer and read it back, saving memory bandwidth during the critical transition from forward to backward pass.
Reset behavior
On reset, all internal signals are set to zero:
always @(*) begin
if (rst) begin
vpu_data_out_1 = 16'b0;
vpu_data_out_2 = 16'b0;
vpu_valid_out_1 = 1'b0;
vpu_valid_out_2 = 1'b0;
// All internal wires zeroed...
end
end
Source: vpu.sv:187-211
Dual-column processing
The VPU processes two columns simultaneously:
- Column 1:
_data_in_1, _data_out_1, _valid_in_1, _valid_out_1
- Column 2:
_data_in_2, _data_out_2, _valid_in_2, _valid_out_2
This matches the 2×2 systolic array width, allowing the bottom two PEs’ outputs to be processed in parallel.
Parent-child module pattern
Each VPU function uses a parent-child pattern:
- Parent module: Instantiates two child modules (one per column)
- Child module: Implements the actual computation
Example for bias addition:
module bias_parent(
// Dual-column interface
);
bias_child column_1 (
.bias_scalar_in(bias_scalar_in_1),
.bias_sys_data_in(bias_sys_data_in_1),
// ...
);
bias_child column_2 (
.bias_scalar_in(bias_scalar_in_2),
.bias_sys_data_in(bias_sys_data_in_2),
// ...
);
endmodule
Source: bias_parent.sv:26-44
This pattern provides:
- Clean separation of concerns
- Easy replication for additional columns
- Independent operation per column
Usage in neural network training
Forward pass
-
Hidden layer:
4'b1100 (bias + ReLU)
Z = W×X + b
H = LeakyReLU(Z)
-
Output layer:
4'b1111 (bias + ReLU + loss + derivative)
Z = W×H + b
Y_hat = LeakyReLU(Z)
∂L/∂Y_hat = (2/N) × (Y_hat - Y)
∂L/∂Z = ∂L/∂Y_hat × LeakyReLU'(Z)
Backward pass
-
Hidden layers:
4'b0001 (derivative only)
∂L/∂H = W^T × ∂L/∂Z_next
∂L/∂Z = ∂L/∂H × LeakyReLU'(Z)
-
Weight gradients: Computed in systolic array