Loss module

The loss module computes the gradient of the mean squared error (MSE) loss function with respect to the network’s output. This gradient serves as the starting point for backpropagation through the neural network layers.

Architecture

The loss module follows a parent-child structure:

loss_parent: Top-level module instantiating two loss_child modules
loss_child: Individual processing unit computing gradient for one output column

This design processes two gradient values in parallel, matching the VPU’s dual-column architecture.

Module ports

loss_parent

clk

input logic

System clock signal

rst

input logic

Active-high reset signal

H_1_in

input logic signed [15:0]

Network output (prediction) for column 1

H_2_in

input logic signed [15:0]

Network output (prediction) for column 2

Y_1_in

input logic signed [15:0]

Target value (ground truth) for column 1 from unified buffer

Y_2_in

input logic signed [15:0]

Target value (ground truth) for column 2 from unified buffer

valid_1_in

input logic

Valid signal for column 1 inputs

valid_2_in

input logic

Valid signal for column 2 inputs

inv_batch_size_times_two_in

input logic signed [15:0]

Precomputed scaling factor (2/N) as fixed-point value from unified buffer

gradient_1_out

output logic signed [15:0]

Computed gradient for column 1

gradient_2_out

output logic signed [15:0]

Computed gradient for column 2

valid_1_out

output logic

Valid signal for column 1 output

valid_2_out

output logic

Valid signal for column 2 output

loss_child

clk

input logic

System clock signal

rst

input logic

Active-high reset signal

H_in

input logic signed [15:0]

Network output (prediction)

Y_in

input logic signed [15:0]

Target value (ground truth)

valid_in

input logic

Input valid signal

inv_batch_size_times_two_in

input logic signed [15:0]

Scaling factor (2/N) as fixed-point value

gradient_out

output logic signed [15:0]

Computed gradient

valid_out

output logic

Output valid signal

Loss function

The mean squared error loss is defined as:

MSE = (1/N) Σ(H - Y)²

Where:

N is the batch size
H is the network output (prediction)
Y is the target value

The derivative with respect to the output is:

∂MSE/∂H = (2/N)(H - Y)

This gradient is what the loss module computes.

Operation

The loss_child module implements a two-stage pipeline for MSE gradient computation:

Pipeline stages

Stage 1 - Difference computation:
- Compute diff = H - Y using fxp_addsub with subtraction mode
- This represents the prediction error
Stage 2 - Scaling:
- Multiply difference by 2/N using fxp_mul
- Results in final gradient: gradient = (2/N) × (H - Y)
Registered output:
- On clock edge, register the gradient and propagate valid signal

Fixed-point arithmetic

The module uses 16-bit signed fixed-point (Q8.8 format) for all operations:

Subtraction: fxp_addsub with sub=1 computes H - Y
- See https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/fixedpoint.sv:186 for implementation
- Handles sign extension and overflow detection
Multiplication: fxp_mul scales the difference by 2/N
- See https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/fixedpoint.sv:278 for implementation
- Properly positions binary point after multiplication

Precomputed scaling factor

The factor 2/N is:

Precomputed by the host and stored in unified buffer
Represented as fixed-point (e.g., for N=4, 2/N = 0.5 = 0x0080 in Q8.8)
Shared across all gradient computations in the batch
Avoids expensive division operations in hardware

Integration with VPU

The loss module is active only during the transition pathway:

Pathway 1111 (transition): systolic → bias → leaky_relu → loss → leaky_relu_derivative → output

When vpu_data_pathway[1] is set to 1:

Leaky ReLU outputs (H matrix) route to loss module H inputs
Target values (Y) provided from unified buffer
Scaling factor (2/N) provided from unified buffer
Loss gradients route to leaky ReLU derivative module
H values are simultaneously cached for use in backward pass

See https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/vpu.sv:268-302 for the loss routing logic.

Data flow

Leaky ReLU Output (H)
         |
         +---> Cache (for backward pass)
         |
         v
[loss_child] <-- Target Y from UB
      |        <-- Scaling 2/N from UB
      v
  Gradient ∂L/∂H
      |
      v
Leaky ReLU Derivative

Implementation details

Latency: 1 clock cycle (pipelined combinational logic with registered output)
Throughput: 2 gradients per cycle
Overflow handling: Both subtraction and multiplication detect overflow
Reset behavior: Gradient output and valid signals cleared to zero
Valid signal propagation: Input valid signal is registered and becomes output valid signal

Transition phase operation

The loss module is critical during the transition between forward and backward passes:

Final forward layer: Produces output H (predictions)
Loss computation: Computes gradients comparing H to targets Y
Backward pass start: Gradients seed the backpropagation process
H caching: Output layer activations are cached for computing activation derivatives

This transition occurs once per training batch at the output layer of the network.

Source files

Parent module: https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/loss_parent.sv
Child module: https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/loss_child.sv

Core Modules

VPU Components

Architecture

Module ports

loss_parent

loss_child

Loss function

Operation

Pipeline stages

Fixed-point arithmetic

Precomputed scaling factor

Integration with VPU

Data flow

Implementation details

Transition phase operation

Source files

Build docs developers (and LLMs) love

Core Modules

VPU Components

​Architecture

​Module ports

​loss_parent

​loss_child

​Loss function

​Operation

​Pipeline stages

​Fixed-point arithmetic

​Precomputed scaling factor

​Integration with VPU

​Data flow

​Implementation details

​Transition phase operation

​Source files

Build docs developers (and LLMs) love

Architecture

Module ports

loss_parent

loss_child

Loss function

Operation

Pipeline stages

Fixed-point arithmetic

Precomputed scaling factor

Integration with VPU

Data flow

Implementation details

Transition phase operation

Source files