The unified buffer (UB) is the central memory system in the Tiny TPU. It stores all matrices, vectors, and intermediate values needed for neural network training, providing dual-port read/write access to support concurrent operations.
Module interface
module unified_buffer #(
parameter int UNIFIED_BUFFER_WIDTH = 128,
parameter int SYSTOLIC_ARRAY_WIDTH = 2
)(
input logic clk,
input logic rst,
// Write ports from VPU to UB
input logic [15:0] ub_wr_data_in [SYSTOLIC_ARRAY_WIDTH],
input logic ub_wr_valid_in [SYSTOLIC_ARRAY_WIDTH],
// Write ports from host to UB (for loading parameters)
input logic [15:0] ub_wr_host_data_in [SYSTOLIC_ARRAY_WIDTH],
input logic ub_wr_host_valid_in [SYSTOLIC_ARRAY_WIDTH],
// Read instruction inputs
input logic ub_rd_start_in,
input logic ub_rd_transpose,
input logic [8:0] ub_ptr_select,
input logic [15:0] ub_rd_addr_in,
input logic [15:0] ub_rd_row_size,
input logic [15:0] ub_rd_col_size,
// Learning rate input
input logic [15:0] learning_rate_in,
// Read ports to various destinations...
);
Source: unified_buffer.sv:6-60
Memory organization
Storage capacity
The unified buffer contains a single-dimensional array:
logic [15:0] ub_memory [0:UNIFIED_BUFFER_WIDTH-1];
With UNIFIED_BUFFER_WIDTH = 128:
- Total capacity: 128 entries × 16 bits = 2,048 bits (256 bytes)
- Each entry: 16-bit signed fixed-point (Q8.8)
Source: unified_buffer.sv:62
Stored data types
The unified buffer stores all data needed for training:
- Input matrices (X) - Training batch activations
- Weight matrices (W) - Layer parameters
- Bias vectors (b) - Layer biases
- Activation values (H) - Post-activation outputs for backprop
- Target values (Y) - Ground truth labels
- Hyperparameters:
- Activation leak factors
- Inverse batch size constants
- Intermediate gradients - During backpropagation
Matrices are stored in row-major format. For a 2-column matrix, column 0 values are at even indices and column 1 values at odd indices.
Write operations
Write from VPU
The VPU writes computation results back to the buffer:
for (int i = SYSTOLIC_ARRAY_WIDTH-1; i >= 0; i--) begin
if (ub_wr_valid_in[i]) begin
ub_memory[wr_ptr] <= ub_wr_data_in[i];
wr_ptr = wr_ptr + 1;
end
end
Source: unified_buffer.sv:344-351
The loop decrements (i—) to maintain row-major storage order when writing multi-column data.
Write from host
The host can load initial parameters (weights, biases, inputs):
for (int i = SYSTOLIC_ARRAY_WIDTH-1; i >= 0; i--) begin
if (ub_wr_host_valid_in[i]) begin
ub_memory[wr_ptr] <= ub_wr_host_data_in[i];
wr_ptr = wr_ptr + 1;
end
end
Source: unified_buffer.sv:348-351
Write pointer
A single write pointer tracks the next write location:
The write pointer auto-increments after each write, requiring careful management by the control unit to avoid overwriting data.
Read operations
The unified buffer supports seven simultaneous read pointers, each serving a different consumer:
logic [15:0] rd_input_ptr; // 0: Input data to systolic array
logic [15:0] rd_weight_ptr; // 1: Weights to systolic array
logic [15:0] rd_bias_ptr; // 2: Bias values to VPU
logic [15:0] rd_Y_ptr; // 3: Target values to VPU
logic [15:0] rd_H_ptr; // 4: Activation values to VPU
logic [15:0] rd_grad_bias_ptr; // 5: Bias gradients to grad descent
logic [15:0] rd_grad_weight_ptr; // 6: Weight gradients to grad descent
Source: unified_buffer.sv:75-117
Reads are initiated by setting control signals:
input logic ub_rd_start_in, // Start read operation
input logic ub_rd_transpose, // Transpose during read
input logic [8:0] ub_ptr_select, // Which pointer to use (0-6)
input logic [15:0] ub_rd_addr_in, // Starting address
input logic [15:0] ub_rd_row_size, // Number of rows
input logic [15:0] ub_rd_col_size, // Number of columns
Source: unified_buffer.sv:22-27
Pointer selection
The ub_ptr_select signal determines which read operation to configure:
always_comb begin
if (ub_rd_start_in) begin
case (ub_ptr_select)
0: begin // Input data pointer
rd_input_transpose = ub_rd_transpose;
rd_input_ptr = ub_rd_addr_in;
// ...
end
1: begin // Weight data pointer
rd_weight_transpose = ub_rd_transpose;
// ...
end
2: begin // Bias pointer
rd_bias_ptr = ub_rd_addr_in;
// ...
end
// Cases 3-6 for Y, H, grad_bias, grad_weight...
endcase
end
end
Source: unified_buffer.sv:168-244
Transpose support
The unified buffer can transpose matrices on-the-fly during reads:
if(ub_rd_transpose) begin
// Switch columns and rows
rd_input_row_size = ub_rd_col_size;
rd_input_col_size = ub_rd_row_size;
end else begin
rd_input_row_size = ub_rd_row_size;
rd_input_col_size = ub_rd_col_size;
end
Source: unified_buffer.sv:176-182
Weight transpose (pointer 1)
Weight reading is more complex due to systolic array requirements:
if(ub_rd_transpose) begin
rd_weight_row_size = ub_rd_col_size;
rd_weight_col_size = ub_rd_row_size;
rd_weight_ptr = ub_rd_addr_in + ub_rd_col_size - 1; // Start at bottom-right
ub_rd_col_size_out = ub_rd_row_size;
end else begin
rd_weight_row_size = ub_rd_row_size;
rd_weight_col_size = ub_rd_col_size;
rd_weight_ptr = ub_rd_addr_in + ub_rd_row_size*ub_rd_col_size - ub_rd_col_size;
ub_rd_col_size_out = ub_rd_col_size;
end
rd_weight_skip_size = ub_rd_col_size + 1;
Source: unified_buffer.sv:187-202
Weights are read in reverse order (bottom-up, right-to-left) to match the systolic array’s data flow requirements. The rd_weight_skip_size determines the stride between elements.
Staggered delivery
To support systolic computation, the unified buffer staggers data delivery using time counters:
logic [15:0] rd_input_time_counter;
if (rd_input_time_counter + 1 < rd_input_row_size + rd_input_col_size) begin
for (int i = 0; i < SYSTOLIC_ARRAY_WIDTH; i++) begin
if(rd_input_time_counter >= i &&
rd_input_time_counter < rd_input_row_size + i &&
i < rd_input_col_size) begin
ub_rd_input_valid_out[i] <= 1'b1;
ub_rd_input_data_out[i] <= ub_memory[rd_input_ptr];
rd_input_ptr = rd_input_ptr + 1;
end else begin
ub_rd_input_valid_out[i] <= 1'b0;
end
end
rd_input_time_counter <= rd_input_time_counter + 1;
end
Source: unified_buffer.sv:371-397
Staggering example
For a 2×2 matrix with 2 columns:
Time 0: Column 0 gets data, Column 1 idle
Time 1: Column 0 gets data, Column 1 gets data
Time 2: Column 0 gets data, Column 1 gets data
Time 3: Column 0 idle, Column 1 gets data
This creates the diagonal wave pattern needed for systolic computation.
Gradient descent integration
The unified buffer contains embedded gradient descent modules:
generate
for (i=0; i<SYSTOLIC_ARRAY_WIDTH; i++) begin : gradient_descent_gen
gradient_descent gradient_descent_inst (
.clk(clk),
.rst(rst),
.lr_in(learning_rate_in),
.grad_in(ub_wr_data_in[i]),
.value_old_in(value_old_in[i]),
.grad_descent_valid_in(grad_descent_valid_in[i]),
.grad_bias_or_weight(grad_bias_or_weight),
.value_updated_out(value_updated_out[i]),
.grad_descent_done_out(grad_descent_done_out[i])
);
end
endgenerate
Source: unified_buffer.sv:132-146
Update mechanism
When gradient descent completes:
if (grad_descent_done_out[i]) begin
ub_memory[grad_descent_ptr] <= value_updated_out[i];
grad_descent_ptr = grad_descent_ptr + 1;
end
This allows in-place parameter updates:
W_new = W_old - learning_rate × ∂L/∂W
Source: unified_buffer.sv:356-361
Read ports
The unified buffer provides dedicated output ports for each consumer:
To systolic array
// Input data (left side of array)
output logic [15:0] ub_rd_input_data_out_0,
output logic [15:0] ub_rd_input_data_out_1,
output logic ub_rd_input_valid_out_0,
output logic ub_rd_input_valid_out_1,
// Weights (top of array)
output logic [15:0] ub_rd_weight_data_out_0,
output logic [15:0] ub_rd_weight_data_out_1,
output logic ub_rd_weight_valid_out_0,
output logic ub_rd_weight_valid_out_1,
Source: unified_buffer.sv:33-43
To VPU
// Bias values
output logic [15:0] ub_rd_bias_data_out_0,
output logic [15:0] ub_rd_bias_data_out_1,
// Target values (Y)
output logic [15:0] ub_rd_Y_data_out_0,
output logic [15:0] ub_rd_Y_data_out_1,
// Activation values (H)
output logic [15:0] ub_rd_H_data_out_0,
output logic [15:0] ub_rd_H_data_out_1,
Source: unified_buffer.sv:45-55
Each output is duplicated for the two columns supported by the 2×2 systolic array.
Memory layout example
Typical memory layout for a simple network:
Address | Content
--------|------------------
0-7 | Input matrix X (2×4)
8-11 | Weight matrix W1 (2×2)
12-15 | Weight matrix W2 (2×2)
16-17 | Bias vector b1 (2)
18-19 | Bias vector b2 (2)
20-23 | Target matrix Y (2×2)
24-27 | Cached H1 values
28-31 | Cached H2 values
32 | Leak factor
33 | Inverse batch size × 2
34-... | Gradients and temporaries
Bandwidth
- Write: 2 values per cycle (from VPU or host)
- Read: Up to 14 values per cycle (7 pointers × 2 columns)
- No conflicts: Reads and writes use separate pointers
Latency
- Write: 1 cycle (registered)
- Read: 1 cycle (registered)
- Auto-increment: Sequential reads stream at 1 value per cycle
Reset behavior
On reset, all memory and control state clears:
if (rst) begin
for (int i = 0; i < UNIFIED_BUFFER_WIDTH; i++) begin
ub_memory[i] <= '0;
end
wr_ptr <= '0;
// All read pointers reset to 0
// All counters reset to 0
end
Source: unified_buffer.sv:283-339