Architecture overview

Tiny TPU is a minimal tensor processing unit reinvented from Google’s TPU V1 and V2 designs. The architecture implements a complete hardware accelerator capable of executing forward and backward propagation for neural network training.

System architecture

The Tiny TPU consists of five major components that work together to accelerate matrix operations and neural network computations:

Processing element (PE) - The fundamental computational unit
Systolic array - A 2D grid of processing elements
Vector processing unit (VPU) - Element-wise operations pipeline
Unified buffer (UB) - Dual-port memory for intermediate values
Control unit - Instruction decoder and system controller

Top-level module

The top-level TPU module connects all major components:

module tpu #(
    parameter int SYSTOLIC_ARRAY_WIDTH = 2
)(
    input logic clk,
    input logic rst,

    // Write ports from host to unified buffer
    input logic [15:0] ub_wr_host_data_in [0:SYSTOLIC_ARRAY_WIDTH-1],
    input logic ub_wr_host_valid_in [0:SYSTOLIC_ARRAY_WIDTH-1],

    // Read instruction inputs
    input logic ub_rd_start_in,
    input logic ub_rd_transpose,
    input logic [8:0] ub_ptr_select,
    input logic [15:0] ub_rd_addr_in,
    input logic [15:0] ub_rd_row_size,
    input logic [15:0] ub_rd_col_size,

    // Learning rate and VPU control
    input logic [15:0] learning_rate_in,
    input logic [3:0] vpu_data_pathway,
    input logic sys_switch_in,
    input logic [15:0] vpu_leak_factor_in,
    input logic [15:0] inv_batch_size_times_two_in
);

Source: tpu.sv:4-31

Data flow

The TPU follows a specific data flow pattern:

Forward pass

Input loading: Matrices are loaded from the host into the unified buffer
Systolic computation: Input and weight matrices flow through the systolic array
- Inputs flow horizontally (left to right)
- Weights flow vertically (top to bottom)
- Partial sums accumulate vertically
VPU processing: Results pass through the VPU pipeline:
- Bias addition
- Leaky ReLU activation
Result storage: Outputs are written back to the unified buffer

Backward pass

Loss computation: VPU computes loss derivatives
Gradient computation: Systolic array computes weight and activation gradients
Activation derivative: VPU applies activation function derivatives
Parameter update: Gradient descent modules update weights and biases

Key features

Fixed-point arithmetic

All computations use 16-bit fixed-point representation (Q8.8 format):

8 bits for integer part
8 bits for fractional part
Signed values using two’s complement

The fixed-point library in fixedpoint.sv provides modules for multiplication (fxp_mul), addition (fxp_add), and other arithmetic operations with overflow detection.

Pipelined architecture

The VPU implements a pipelined architecture where multiple modules can process different data simultaneously:

vpu_data_pathway[3:0]:
No modules active
Forward pass (bias → leaky relu)
Transition (bias → leaky relu → loss → leaky relu derivative)
Backward pass (leaky relu derivative only)

Source: vpu.sv:10-17

Configurable dimensions

The systolic array width is configurable via the SYSTOLIC_ARRAY_WIDTH parameter:

Default: 2×2 array
Scalable to larger dimensions (e.g., 256×256, 512×512)

Larger array dimensions require modifications to the unified buffer size and interconnect logic.

Performance characteristics

Throughput

Each processing element performs one multiply-accumulate (MAC) operation per clock cycle:

2×2 array: 4 MACs per cycle
Single-cycle operation for activated PEs

Memory bandwidth

The unified buffer provides:

Dual-port read/write capability
Staggered data delivery for systolic flow
Transpose support for efficient matrix operations

Implementation details

Clock and reset

All modules use synchronous design:

Positive edge-triggered flip-flops
Asynchronous active-high reset

Data widths

Standardized 16-bit data paths throughout:

Input activations: 16 bits signed
Weights: 16 bits signed
Partial sums: 16 bits signed
Bias values: 16 bits signed

Next steps

Explore each component in detail:

Processing element

Learn about the PE multiply-accumulate unit

Systolic array

Understand the 2D PE grid architecture

Vector processing unit

Explore the VPU pipeline stages

Unified buffer

Discover the memory architecture

Get Started

Architecture

Instruction Set

Development

Architecture overview

System architecture

Top-level module

Data flow

Forward pass

Backward pass

Key features

Fixed-point arithmetic

Pipelined architecture

Configurable dimensions

Performance characteristics

Throughput

Memory bandwidth

Implementation details

Clock and reset

Data widths

Next steps

Processing element

Systolic array

Vector processing unit

Unified buffer

Build docs developers (and LLMs) love

Get Started

Architecture

Instruction Set

Development

​System architecture

​Top-level module

​Data flow

​Forward pass

​Backward pass

​Key features

​Fixed-point arithmetic

​Pipelined architecture

​Configurable dimensions

​Performance characteristics

​Throughput

​Memory bandwidth

​Implementation details

​Clock and reset

​Data widths

​Next steps

Processing element

Systolic array

Vector processing unit

Unified buffer

Build docs developers (and LLMs) love

System architecture

Top-level module

Data flow

Forward pass

Backward pass

Key features

Fixed-point arithmetic

Pipelined architecture

Configurable dimensions

Performance characteristics

Throughput

Memory bandwidth

Implementation details

Clock and reset

Data widths

Next steps