Unified buffer

The unified buffer (UB) is the central memory system in the Tiny TPU. It stores all matrices, vectors, and intermediate values needed for neural network training, providing dual-port read/write access to support concurrent operations.

Module interface

module unified_buffer #(
    parameter int UNIFIED_BUFFER_WIDTH = 128,
    parameter int SYSTOLIC_ARRAY_WIDTH = 2
)(
    input logic clk,
    input logic rst,

    // Write ports from VPU to UB
    input logic [15:0] ub_wr_data_in [SYSTOLIC_ARRAY_WIDTH],
    input logic ub_wr_valid_in [SYSTOLIC_ARRAY_WIDTH],

    // Write ports from host to UB (for loading parameters)
    input logic [15:0] ub_wr_host_data_in [SYSTOLIC_ARRAY_WIDTH],
    input logic ub_wr_host_valid_in [SYSTOLIC_ARRAY_WIDTH],

    // Read instruction inputs
    input logic ub_rd_start_in,
    input logic ub_rd_transpose,
    input logic [8:0] ub_ptr_select,
    input logic [15:0] ub_rd_addr_in,
    input logic [15:0] ub_rd_row_size,
    input logic [15:0] ub_rd_col_size,

    // Learning rate input
    input logic [15:0] learning_rate_in,

    // Read ports to various destinations...
);

Source: unified_buffer.sv:6-60

Memory organization

Storage capacity

The unified buffer contains a single-dimensional array:

logic [15:0] ub_memory [0:UNIFIED_BUFFER_WIDTH-1];

With UNIFIED_BUFFER_WIDTH = 128:

Total capacity: 128 entries × 16 bits = 2,048 bits (256 bytes)
Each entry: 16-bit signed fixed-point (Q8.8)

Source: unified_buffer.sv:62

Stored data types

The unified buffer stores all data needed for training:

Input matrices (X) - Training batch activations
Weight matrices (W) - Layer parameters
Bias vectors (b) - Layer biases
Activation values (H) - Post-activation outputs for backprop
Target values (Y) - Ground truth labels
Hyperparameters:
- Activation leak factors
- Inverse batch size constants
Intermediate gradients - During backpropagation

Matrices are stored in row-major format. For a 2-column matrix, column 0 values are at even indices and column 1 values at odd indices.

Write operations

Write from VPU

The VPU writes computation results back to the buffer:

for (int i = SYSTOLIC_ARRAY_WIDTH-1; i >= 0; i--) begin
    if (ub_wr_valid_in[i]) begin
        ub_memory[wr_ptr] <= ub_wr_data_in[i];
        wr_ptr = wr_ptr + 1;
    end
end

Source: unified_buffer.sv:344-351

The loop decrements (i—) to maintain row-major storage order when writing multi-column data.

Write from host

The host can load initial parameters (weights, biases, inputs):

for (int i = SYSTOLIC_ARRAY_WIDTH-1; i >= 0; i--) begin
    if (ub_wr_host_valid_in[i]) begin
        ub_memory[wr_ptr] <= ub_wr_host_data_in[i];
        wr_ptr = wr_ptr + 1;
    end
end

Source: unified_buffer.sv:348-351

Write pointer

A single write pointer tracks the next write location:

logic [15:0] wr_ptr;

The write pointer auto-increments after each write, requiring careful management by the control unit to avoid overwriting data.

Read operations

The unified buffer supports seven simultaneous read pointers, each serving a different consumer:

logic [15:0] rd_input_ptr;        // 0: Input data to systolic array
logic [15:0] rd_weight_ptr;       // 1: Weights to systolic array
logic [15:0] rd_bias_ptr;         // 2: Bias values to VPU
logic [15:0] rd_Y_ptr;            // 3: Target values to VPU
logic [15:0] rd_H_ptr;            // 4: Activation values to VPU
logic [15:0] rd_grad_bias_ptr;    // 5: Bias gradients to grad descent
logic [15:0] rd_grad_weight_ptr;  // 6: Weight gradients to grad descent

Source: unified_buffer.sv:75-117

Read instruction format

Reads are initiated by setting control signals:

input logic ub_rd_start_in,         // Start read operation
input logic ub_rd_transpose,        // Transpose during read
input logic [8:0] ub_ptr_select,    // Which pointer to use (0-6)
input logic [15:0] ub_rd_addr_in,   // Starting address
input logic [15:0] ub_rd_row_size,  // Number of rows
input logic [15:0] ub_rd_col_size,  // Number of columns

Source: unified_buffer.sv:22-27

Pointer selection

The ub_ptr_select signal determines which read operation to configure:

always_comb begin
    if (ub_rd_start_in) begin
        case (ub_ptr_select)
            0: begin  // Input data pointer
                rd_input_transpose = ub_rd_transpose;
                rd_input_ptr = ub_rd_addr_in;
                // ...
            end
            1: begin  // Weight data pointer
                rd_weight_transpose = ub_rd_transpose;
                // ...
            end
            2: begin  // Bias pointer
                rd_bias_ptr = ub_rd_addr_in;
                // ...
            end
            // Cases 3-6 for Y, H, grad_bias, grad_weight...
        endcase
    end
end

Source: unified_buffer.sv:168-244

Transpose support

The unified buffer can transpose matrices on-the-fly during reads:

Input transpose (pointer 0)

if(ub_rd_transpose) begin
    // Switch columns and rows
    rd_input_row_size = ub_rd_col_size;
    rd_input_col_size = ub_rd_row_size;
end else begin
    rd_input_row_size = ub_rd_row_size;
    rd_input_col_size = ub_rd_col_size;
end

Source: unified_buffer.sv:176-182

Weight transpose (pointer 1)

Weight reading is more complex due to systolic array requirements:

if(ub_rd_transpose) begin
    rd_weight_row_size = ub_rd_col_size;
    rd_weight_col_size = ub_rd_row_size;
    rd_weight_ptr = ub_rd_addr_in + ub_rd_col_size - 1;  // Start at bottom-right
    ub_rd_col_size_out = ub_rd_row_size;
end else begin
    rd_weight_row_size = ub_rd_row_size;
    rd_weight_col_size = ub_rd_col_size;
    rd_weight_ptr = ub_rd_addr_in + ub_rd_row_size*ub_rd_col_size - ub_rd_col_size;
    ub_rd_col_size_out = ub_rd_col_size;
end
rd_weight_skip_size = ub_rd_col_size + 1;

Source: unified_buffer.sv:187-202

Weights are read in reverse order (bottom-up, right-to-left) to match the systolic array’s data flow requirements. The rd_weight_skip_size determines the stride between elements.

Staggered delivery

To support systolic computation, the unified buffer staggers data delivery using time counters:

logic [15:0] rd_input_time_counter;

if (rd_input_time_counter + 1 < rd_input_row_size + rd_input_col_size) begin
    for (int i = 0; i < SYSTOLIC_ARRAY_WIDTH; i++) begin
        if(rd_input_time_counter >= i && 
           rd_input_time_counter < rd_input_row_size + i && 
           i < rd_input_col_size) begin
            ub_rd_input_valid_out[i] <= 1'b1;
            ub_rd_input_data_out[i] <= ub_memory[rd_input_ptr];
            rd_input_ptr = rd_input_ptr + 1;
        end else begin
            ub_rd_input_valid_out[i] <= 1'b0;
        end
    end
    rd_input_time_counter <= rd_input_time_counter + 1;
end

Source: unified_buffer.sv:371-397

Staggering example

For a 2×2 matrix with 2 columns:

Time 0: Column 0 gets data, Column 1 idle
Time 1: Column 0 gets data, Column 1 gets data
Time 2: Column 0 gets data, Column 1 gets data  
Time 3: Column 0 idle,      Column 1 gets data

This creates the diagonal wave pattern needed for systolic computation.

Gradient descent integration

The unified buffer contains embedded gradient descent modules:

generate
    for (i=0; i<SYSTOLIC_ARRAY_WIDTH; i++) begin : gradient_descent_gen
        gradient_descent gradient_descent_inst (
            .clk(clk),
            .rst(rst),
            .lr_in(learning_rate_in),
            .grad_in(ub_wr_data_in[i]),
            .value_old_in(value_old_in[i]),
            .grad_descent_valid_in(grad_descent_valid_in[i]),
            .grad_bias_or_weight(grad_bias_or_weight),
            .value_updated_out(value_updated_out[i]),
            .grad_descent_done_out(grad_descent_done_out[i])
        );
    end
endgenerate

Source: unified_buffer.sv:132-146

Update mechanism

When gradient descent completes:

if (grad_descent_done_out[i]) begin
    ub_memory[grad_descent_ptr] <= value_updated_out[i];
    grad_descent_ptr = grad_descent_ptr + 1;
end

This allows in-place parameter updates:

W_new = W_old - learning_rate × ∂L/∂W

Source: unified_buffer.sv:356-361

Read ports

The unified buffer provides dedicated output ports for each consumer:

To systolic array

// Input data (left side of array)
output logic [15:0] ub_rd_input_data_out_0,
output logic [15:0] ub_rd_input_data_out_1,
output logic ub_rd_input_valid_out_0,
output logic ub_rd_input_valid_out_1,

// Weights (top of array)
output logic [15:0] ub_rd_weight_data_out_0,
output logic [15:0] ub_rd_weight_data_out_1,
output logic ub_rd_weight_valid_out_0,
output logic ub_rd_weight_valid_out_1,

Source: unified_buffer.sv:33-43

To VPU

// Bias values
output logic [15:0] ub_rd_bias_data_out_0,
output logic [15:0] ub_rd_bias_data_out_1,

// Target values (Y)
output logic [15:0] ub_rd_Y_data_out_0,
output logic [15:0] ub_rd_Y_data_out_1,

// Activation values (H)
output logic [15:0] ub_rd_H_data_out_0,
output logic [15:0] ub_rd_H_data_out_1,

Source: unified_buffer.sv:45-55

Each output is duplicated for the two columns supported by the 2×2 systolic array.

Memory layout example

Typical memory layout for a simple network:

Address | Content
--------|------------------
0-7     | Input matrix X (2×4)
8-11    | Weight matrix W1 (2×2)
12-15   | Weight matrix W2 (2×2)
16-17   | Bias vector b1 (2)
18-19   | Bias vector b2 (2)
20-23   | Target matrix Y (2×2)
24-27   | Cached H1 values
28-31   | Cached H2 values
32      | Leak factor
33      | Inverse batch size × 2
34-...  | Gradients and temporaries

Performance characteristics

Bandwidth

Write: 2 values per cycle (from VPU or host)
Read: Up to 14 values per cycle (7 pointers × 2 columns)
No conflicts: Reads and writes use separate pointers

Latency

Write: 1 cycle (registered)
Read: 1 cycle (registered)
Auto-increment: Sequential reads stream at 1 value per cycle

Reset behavior

On reset, all memory and control state clears:

if (rst) begin
    for (int i = 0; i < UNIFIED_BUFFER_WIDTH; i++) begin
        ub_memory[i] <= '0;
    end
    wr_ptr <= '0;
    // All read pointers reset to 0
    // All counters reset to 0
end

Source: unified_buffer.sv:283-339

Get Started

Architecture

Instruction Set

Development

Module interface

Memory organization

Storage capacity

Stored data types

Write operations

Write from VPU

Write from host

Write pointer

Read operations

Read instruction format

Pointer selection

Transpose support

Input transpose (pointer 0)

Weight transpose (pointer 1)

Staggered delivery

Staggering example

Gradient descent integration

Update mechanism

Read ports

To systolic array

To VPU

Memory layout example

Performance characteristics

Bandwidth

Latency

Reset behavior

Build docs developers (and LLMs) love

Get Started

Architecture

Instruction Set

Development

​Module interface

​Memory organization

​Storage capacity

​Stored data types

​Write operations

​Write from VPU

​Write from host

​Write pointer

​Read operations

​Read instruction format

​Pointer selection

​Transpose support

​Input transpose (pointer 0)

​Weight transpose (pointer 1)

​Staggered delivery

​Staggering example

​Gradient descent integration

​Update mechanism

​Read ports

​To systolic array

​To VPU

​Memory layout example

​Performance characteristics

​Bandwidth

​Latency

​Reset behavior

Build docs developers (and LLMs) love

Module interface

Memory organization

Storage capacity

Stored data types

Write operations

Write from VPU

Write from host

Write pointer

Read operations

Read instruction format

Pointer selection

Transpose support

Input transpose (pointer 0)

Weight transpose (pointer 1)

Staggered delivery

Staggering example

Gradient descent integration

Update mechanism

Read ports

To systolic array

To VPU

Memory layout example

Performance characteristics

Bandwidth

Latency

Reset behavior