
What is Tiny TPU?
Tiny TPU is a minimal tensor processing unit (TPU) designed from the ground up, reinvented from Google’s TPU V2 and V1 architectures. Built entirely in SystemVerilog, this project demonstrates how to design a hardware accelerator capable of performing neural network operations at the chip level. Unlike closed-source TPU architectures, Tiny TPU is completely open source and designed to be the ultimate guide for breaking into chip accelerator design — even if you just learned high school math and only know y = mx + b.Quick start
Get your first TPU simulation running in minutes
Architecture
Explore the systolic array, VPU, and unified buffer design
Instruction set
Learn the 88-bit ISA that controls the TPU
Development
Add new modules and contribute to the project
Key features
Systolic array
A 2D grid of processing elements that perform multiply-accumulate operations every clock cycle with data flowing horizontally and partial sums vertically
Vector processing unit
Pipelined modules for bias addition, Leaky ReLU activation, MSE loss computation, and backpropagation
Unified buffer
Dual-port memory for storing input matrices, weights, biases, and intermediate activation values
Custom ISA
An 88-bit instruction set architecture for controlling data flow and computation stages
How it works
The Tiny TPU architecture consists of five main components working together:Control unit
Reads 88-bit instructions that orchestrate data movement and computation across all subsystems
Unified buffer
Stores all matrices, weights, and intermediate values with dual-port access for simultaneous reads and writes
Systolic array
A 2D grid of processing elements (PEs) where each PE performs multiply-accumulate operations. Data flows horizontally, partial sums flow vertically, and weights remain stationary
Vector processing unit
Applies element-wise operations like bias addition and activation functions in a pipelined architecture
Processing element operation
At the heart of the systolic array is the processing element (PE), which performs the fundamental multiply-accumulate operation:- Multiplies incoming data by a stored weight
- Adds the result to an incoming partial sum
- Passes data horizontally to the next PE
- Passes the computed sum vertically downward
The systolic array design allows for highly parallel matrix multiplication with minimal data movement, making it extremely efficient for neural network workloads.
Why Tiny TPU?
The details of TPU architecture are typically closed source, as is most chip design. This project was created by a dedicated group with no prior professional hardware architecture experience to:- Provide a complete, open-source reference for tensor processing unit design
- Demonstrate how to approach complex hardware problems with an inventive mindset
- Make chip accelerator design accessible to all levels of technical expertise
- Serve as a learning resource for SystemVerilog, hardware testing, and digital design
This project uses real hardware description language (SystemVerilog), actual simulation tools (Icarus Verilog), and industry-standard testing frameworks (cocotb) — the same tools used in professional chip design.
What you’ll learn
By working through this documentation and exploring the source code, you’ll gain hands-on experience with:- Hardware description languages: Writing and understanding SystemVerilog modules
- Digital design: Implementing multiply-accumulate units, memory buffers, and control logic
- Systolic architectures: Understanding data flow in matrix multiplication accelerators
- Verification: Testing hardware with Python-based testbenches using cocotb
- Waveform analysis: Debugging digital circuits with GTKWave
- Fixed-point arithmetic: Implementing neural network operations in hardware
Next steps
Get started
Run your first TPU simulation and see matrix multiplication in action
Explore the architecture
Deep dive into the systolic array, VPU, and unified buffer designs
