Architecture
The loss module follows a parent-child structure:- loss_parent: Top-level module instantiating two loss_child modules
- loss_child: Individual processing unit computing gradient for one output column
Module ports
loss_parent
System clock signal
Active-high reset signal
Network output (prediction) for column 1
Network output (prediction) for column 2
Target value (ground truth) for column 1 from unified buffer
Target value (ground truth) for column 2 from unified buffer
Valid signal for column 1 inputs
Valid signal for column 2 inputs
Precomputed scaling factor (2/N) as fixed-point value from unified buffer
Computed gradient for column 1
Computed gradient for column 2
Valid signal for column 1 output
Valid signal for column 2 output
loss_child
System clock signal
Active-high reset signal
Network output (prediction)
Target value (ground truth)
Input valid signal
Scaling factor (2/N) as fixed-point value
Computed gradient
Output valid signal
Loss function
The mean squared error loss is defined as:- N is the batch size
- H is the network output (prediction)
- Y is the target value
Operation
The loss_child module implements a two-stage pipeline for MSE gradient computation:Pipeline stages
-
Stage 1 - Difference computation:
- Compute
diff = H - Yusingfxp_addsubwith subtraction mode - This represents the prediction error
- Compute
-
Stage 2 - Scaling:
- Multiply difference by
2/Nusingfxp_mul - Results in final gradient:
gradient = (2/N) × (H - Y)
- Multiply difference by
-
Registered output:
- On clock edge, register the gradient and propagate valid signal
Fixed-point arithmetic
The module uses 16-bit signed fixed-point (Q8.8 format) for all operations:-
Subtraction:
fxp_addsubwithsub=1computesH - Y- See
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/fixedpoint.sv:186for implementation - Handles sign extension and overflow detection
- See
-
Multiplication:
fxp_mulscales the difference by2/N- See
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/fixedpoint.sv:278for implementation - Properly positions binary point after multiplication
- See
Precomputed scaling factor
The factor2/N is:
- Precomputed by the host and stored in unified buffer
- Represented as fixed-point (e.g., for N=4, 2/N = 0.5 = 0x0080 in Q8.8)
- Shared across all gradient computations in the batch
- Avoids expensive division operations in hardware
Integration with VPU
The loss module is active only during the transition pathway:- Pathway 1111 (transition):
systolic → bias → leaky_relu → loss → leaky_relu_derivative → output
vpu_data_pathway[1] is set to 1:
- Leaky ReLU outputs (H matrix) route to loss module H inputs
- Target values (Y) provided from unified buffer
- Scaling factor (2/N) provided from unified buffer
- Loss gradients route to leaky ReLU derivative module
- H values are simultaneously cached for use in backward pass
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/vpu.sv:268-302 for the loss routing logic.
Data flow
Implementation details
- Latency: 1 clock cycle (pipelined combinational logic with registered output)
- Throughput: 2 gradients per cycle
- Overflow handling: Both subtraction and multiplication detect overflow
- Reset behavior: Gradient output and valid signals cleared to zero
- Valid signal propagation: Input valid signal is registered and becomes output valid signal
Transition phase operation
The loss module is critical during the transition between forward and backward passes:- Final forward layer: Produces output H (predictions)
- Loss computation: Computes gradients comparing H to targets Y
- Backward pass start: Gradients seed the backpropagation process
- H caching: Output layer activations are cached for computing activation derivatives
Source files
- Parent module:
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/loss_parent.sv - Child module:
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/loss_child.sv