Introduction to Tiny TPU

What is Tiny TPU?

Tiny TPU is a minimal tensor processing unit (TPU) designed from the ground up, reinvented from Google’s TPU V2 and V1 architectures. Built entirely in SystemVerilog, this project demonstrates how to design a hardware accelerator capable of performing neural network operations at the chip level. Unlike closed-source TPU architectures, Tiny TPU is completely open source and designed to be the ultimate guide for breaking into chip accelerator design — even if you just learned high school math and only know y = mx + b.

Quick start

Get your first TPU simulation running in minutes

Architecture

Explore the systolic array, VPU, and unified buffer design

Instruction set

Learn the 88-bit ISA that controls the TPU

Development

Add new modules and contribute to the project

Key features

Systolic array

A 2D grid of processing elements that perform multiply-accumulate operations every clock cycle with data flowing horizontally and partial sums vertically

Vector processing unit

Pipelined modules for bias addition, Leaky ReLU activation, MSE loss computation, and backpropagation

Unified buffer

Dual-port memory for storing input matrices, weights, biases, and intermediate activation values

Custom ISA

An 88-bit instruction set architecture for controlling data flow and computation stages

How it works

The Tiny TPU architecture consists of five main components working together:

Control unit

Reads 88-bit instructions that orchestrate data movement and computation across all subsystems

Unified buffer

Stores all matrices, weights, and intermediate values with dual-port access for simultaneous reads and writes

Systolic array

A 2D grid of processing elements (PEs) where each PE performs multiply-accumulate operations. Data flows horizontally, partial sums flow vertically, and weights remain stationary

Vector processing unit

Applies element-wise operations like bias addition and activation functions in a pipelined architecture

Output

Results are written back to the unified buffer or used for further computation stages

Processing element operation

At the heart of the systolic array is the processing element (PE), which performs the fundamental multiply-accumulate operation:

output_sum = (input_data × weight) + input_partial_sum

Each PE:

Multiplies incoming data by a stored weight
Adds the result to an incoming partial sum
Passes data horizontally to the next PE
Passes the computed sum vertically downward

The systolic array design allows for highly parallel matrix multiplication with minimal data movement, making it extremely efficient for neural network workloads.

Why Tiny TPU?

The details of TPU architecture are typically closed source, as is most chip design. This project was created by a dedicated group with no prior professional hardware architecture experience to:

Provide a complete, open-source reference for tensor processing unit design
Demonstrate how to approach complex hardware problems with an inventive mindset
Make chip accelerator design accessible to all levels of technical expertise
Serve as a learning resource for SystemVerilog, hardware testing, and digital design

This project uses real hardware description language (SystemVerilog), actual simulation tools (Icarus Verilog), and industry-standard testing frameworks (cocotb) — the same tools used in professional chip design.

What you’ll learn

By working through this documentation and exploring the source code, you’ll gain hands-on experience with:

Hardware description languages: Writing and understanding SystemVerilog modules
Digital design: Implementing multiply-accumulate units, memory buffers, and control logic
Systolic architectures: Understanding data flow in matrix multiplication accelerators
Verification: Testing hardware with Python-based testbenches using cocotb
Waveform analysis: Debugging digital circuits with GTKWave
Fixed-point arithmetic: Implementing neural network operations in hardware

Get Started

Architecture

Instruction Set

Development

Introduction to Tiny TPU

What is Tiny TPU?

Quick start

Architecture

Instruction set

Development

Key features

Systolic array

Vector processing unit

Unified buffer

Custom ISA

How it works

Processing element operation

Why Tiny TPU?

What you’ll learn

Next steps

Get started

Explore the architecture

Build docs developers (and LLMs) love

Get Started

Architecture

Instruction Set

Development

​What is Tiny TPU?

Quick start

Architecture

Instruction set

Development

​Key features

Systolic array

Vector processing unit

Unified buffer

Custom ISA

​How it works

​Processing element operation

​Why Tiny TPU?

​What you’ll learn

​Next steps

Get started

Explore the architecture

Build docs developers (and LLMs) love

What is Tiny TPU?

Key features

How it works

Processing element operation

Why Tiny TPU?

What you’ll learn

Next steps