Architecture
The module follows the standard parent-child hierarchy:- leaky_relu_derivative_parent: Top-level module instantiating two child modules
- leaky_relu_derivative_child: Processing unit computing derivative for one column
Module ports
leaky_relu_derivative_parent
System clock signal
Active-high reset signal
Leak factor (α) used in forward pass, shared across both columns
Valid signal for column 1 input
Valid signal for column 2 input
Upstream gradient for column 1
Upstream gradient for column 2
Cached forward pass activation (H) for column 1
Cached forward pass activation (H) for column 2
Computed gradient for column 1
Computed gradient for column 2
Valid signal for column 1 output
Valid signal for column 2 output
leaky_relu_derivative_child
System clock signal
Active-high reset signal
Input valid signal
Upstream gradient (∂L/∂H)
Leak factor (α)
Forward pass activation value (H) for determining derivative
Output gradient (∂L/∂Z)
Output valid signal
Derivative function
The derivative of leaky ReLU is:- ∂L/∂H is the upstream gradient (from the next layer)
- f’(Z) is the activation derivative
- ∂L/∂Z is the gradient to propagate to the previous layer
Operation
Algorithm
The derivative module determines the activation derivative based on the sign of the cached forward pass activation (H):- Check forward pass value: Examine sign of
lr_d_H_data_in - Conditional gradient computation:
- If
H >= 0: Derivative is 1, pass gradient through unchanged:output = input - If
H < 0: Derivative is α, scale gradient:output = input × α
- If
- Register output: On clock edge, output the computed gradient with valid signal
Pipeline stages
- Sign detection: Check if cached activation H is non-negative (combinational)
- Conditional computation:
- Non-negative path: Direct assignment (no operation)
- Negative path: Fixed-point multiply using
fxp_mul
- Registered output: Result and valid signal latched on clock edge
Why use H instead of Z?
The module uses the activated value H rather than the pre-activation Z to determine the derivative:- For standard leaky ReLU: sign(H) = sign(Z), so either works
- Using H is convenient because it’s already available from the forward pass
- H values are cached in the VPU during the transition pathway
- This avoids needing to cache additional pre-activation values
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/leaky_relu_derivative_child.sv:31 for the implementation.
Fixed-point arithmetic
The module uses 16-bit signed fixed-point (Q8.8 format):-
Multiplication: When H < 0,
fxp_mulcomputesgradient × leak_factor- See
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/fixedpoint.sv:278 - Handles binary point alignment
- Detects overflow conditions
- See
- Pass-through: When H >= 0, gradient passes unchanged (derivative = 1)
Integration with VPU
The leaky ReLU derivative module is active during transition and backward pass pathways:- Pathway 1111 (transition):
systolic → bias → leaky_relu → loss → leaky_relu_derivative → output - Pathway 0001 (backward):
systolic → leaky_relu_derivative → output
vpu_data_pathway[0] is set to 1:
Transition pathway (1111)
- Loss module gradients route to derivative inputs
- Cached H values (from leaky ReLU forward pass) route to H inputs
- Leak factor provided from unified buffer
- Outputs route to final VPU output (back to unified buffer)
Backward pathway (0001)
- Systolic array outputs (upstream gradients) route to derivative inputs
- H values provided from unified buffer (pre-cached from forward pass)
- Leak factor provided from unified buffer
- Outputs route to final VPU output for further backpropagation
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/vpu.sv:304-328 for the derivative routing logic.
Data flow
Transition phase
Backward phase
H value caching
The VPU includes special logic for caching H values:- During transition pathway (1111): H values from leaky ReLU are cached in internal registers
- Cache update: See
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/vpu.sv:282-285 - Cache usage: Cached values route to derivative module during transition
- For subsequent backward passes: H values are loaded from unified buffer (pre-stored during forward pass)
Implementation details
- Latency: 1 clock cycle (registered output)
- Throughput: 2 gradients per cycle
- Sign check: Uses MSB of H value (sign bit)
- Multiplication: Only performed for negative activations
- Reset behavior: Outputs and valid signals cleared to zero
- Valid signal: Propagated from input to output with one cycle delay
Gradient flow example
Consider a batch element where:- Upstream gradient:
∂L/∂H = 0.5(0x0080 in Q8.8) - Cached activation:
H = -0.2(0xFF33 in Q8.8) - Leak factor:
α = 0.1(0x0019 in Q8.8)
- Check H: H < 0, so use scaled path
- Multiply:
0.5 × 0.1 = 0.05 - Output:
∂L/∂Z = 0.05(0x000C in Q8.8)
Source files
- Parent module:
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/leaky_relu_derivative_parent.sv - Child module:
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/leaky_relu_derivative_child.sv