Leaky ReLU derivative module

The leaky ReLU derivative module computes the derivative of the leaky ReLU activation function during backpropagation. This module multiplies upstream gradients by the local activation derivative, implementing the chain rule for gradient flow through the network.

Architecture

The module follows the standard parent-child hierarchy:

leaky_relu_derivative_parent: Top-level module instantiating two child modules
leaky_relu_derivative_child: Processing unit computing derivative for one column

The dual-column architecture processes two gradient values in parallel, maintaining consistency with the VPU’s systolic array configuration.

Module ports

leaky_relu_derivative_parent

clk

input logic

System clock signal

rst

input logic

Active-high reset signal

lr_leak_factor_in

input logic signed [15:0]

Leak factor (α) used in forward pass, shared across both columns

lr_d_valid_1_in

input logic

Valid signal for column 1 input

lr_d_valid_2_in

input logic

Valid signal for column 2 input

lr_d_data_1_in

input logic signed [15:0]

Upstream gradient for column 1

lr_d_data_2_in

input logic signed [15:0]

Upstream gradient for column 2

lr_d_H_1_in

input logic signed [15:0]

Cached forward pass activation (H) for column 1

lr_d_H_2_in

input logic signed [15:0]

Cached forward pass activation (H) for column 2

lr_d_data_1_out

output logic signed [15:0]

Computed gradient for column 1

lr_d_data_2_out

output logic signed [15:0]

Computed gradient for column 2

lr_d_valid_1_out

output logic

Valid signal for column 1 output

lr_d_valid_2_out

output logic

Valid signal for column 2 output

leaky_relu_derivative_child

clk

input logic

System clock signal

rst

input logic

Active-high reset signal

lr_d_valid_in

input logic

Input valid signal

lr_d_data_in

input logic signed [15:0]

Upstream gradient (∂L/∂H)

lr_leak_factor_in

input logic signed [15:0]

Leak factor (α)

lr_d_H_data_in

input logic signed [15:0]

Forward pass activation value (H) for determining derivative

lr_d_data_out

output logic signed [15:0]

Output gradient (∂L/∂Z)

lr_d_valid_out

output logic

Output valid signal

Derivative function

The derivative of leaky ReLU is:

f'(z) = { 1     if z ≥ 0
        { α     if z < 0

Where z is the pre-activation value and α is the leak factor. During backpropagation, the chain rule gives:

∂L/∂Z = ∂L/∂H × f'(Z)

Where:

∂L/∂H is the upstream gradient (from the next layer)
f’(Z) is the activation derivative
∂L/∂Z is the gradient to propagate to the previous layer

Operation

Algorithm

The derivative module determines the activation derivative based on the sign of the cached forward pass activation (H):

Check forward pass value: Examine sign of lr_d_H_data_in
Conditional gradient computation:
- If H >= 0: Derivative is 1, pass gradient through unchanged: output = input
- If H < 0: Derivative is α, scale gradient: output = input × α
Register output: On clock edge, output the computed gradient with valid signal

Pipeline stages

Sign detection: Check if cached activation H is non-negative (combinational)
Conditional computation:
- Non-negative path: Direct assignment (no operation)
- Negative path: Fixed-point multiply using fxp_mul
Registered output: Result and valid signal latched on clock edge

Why use H instead of Z?

The module uses the activated value H rather than the pre-activation Z to determine the derivative:

For standard leaky ReLU: sign(H) = sign(Z), so either works
Using H is convenient because it’s already available from the forward pass
H values are cached in the VPU during the transition pathway
This avoids needing to cache additional pre-activation values

See https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/leaky_relu_derivative_child.sv:31 for the implementation.

Fixed-point arithmetic

The module uses 16-bit signed fixed-point (Q8.8 format):

Multiplication: When H < 0, fxp_mul computes gradient × leak_factor
- See https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/fixedpoint.sv:278
- Handles binary point alignment
- Detects overflow conditions
Pass-through: When H >= 0, gradient passes unchanged (derivative = 1)

Integration with VPU

The leaky ReLU derivative module is active during transition and backward pass pathways:

Pathway 1111 (transition): systolic → bias → leaky_relu → loss → leaky_relu_derivative → output
Pathway 0001 (backward): systolic → leaky_relu_derivative → output

When vpu_data_pathway[0] is set to 1:

Transition pathway (1111)

Loss module gradients route to derivative inputs
Cached H values (from leaky ReLU forward pass) route to H inputs
Leak factor provided from unified buffer
Outputs route to final VPU output (back to unified buffer)

Backward pathway (0001)

Systolic array outputs (upstream gradients) route to derivative inputs
H values provided from unified buffer (pre-cached from forward pass)
Leak factor provided from unified buffer
Outputs route to final VPU output for further backpropagation

See https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/vpu.sv:304-328 for the derivative routing logic.

Data flow

Transition phase

Loss Gradient
      |
      v
[lr_derivative_child] <-- H from leaky_relu cache
                      <-- Leak factor α from UB
      |
      v
 Output to UB (∂L/∂Z)

Backward phase

Systolic Array (upstream ∂L/∂H)
      |
      v
[lr_derivative_child] <-- H from UB (cached)
                      <-- Leak factor α from UB
      |
      v
 Output to UB (∂L/∂Z)

H value caching

The VPU includes special logic for caching H values:

During transition pathway (1111): H values from leaky ReLU are cached in internal registers
Cache update: See https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/vpu.sv:282-285
Cache usage: Cached values route to derivative module during transition
For subsequent backward passes: H values are loaded from unified buffer (pre-stored during forward pass)

Implementation details

Latency: 1 clock cycle (registered output)
Throughput: 2 gradients per cycle
Sign check: Uses MSB of H value (sign bit)
Multiplication: Only performed for negative activations
Reset behavior: Outputs and valid signals cleared to zero
Valid signal: Propagated from input to output with one cycle delay

Gradient flow example

Consider a batch element where:

Upstream gradient: ∂L/∂H = 0.5 (0x0080 in Q8.8)
Cached activation: H = -0.2 (0xFF33 in Q8.8)
Leak factor: α = 0.1 (0x0019 in Q8.8)

The module computes:

Check H: H < 0, so use scaled path
Multiply: 0.5 × 0.1 = 0.05
Output: ∂L/∂Z = 0.05 (0x000C in Q8.8)

Source files

Parent module: https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/leaky_relu_derivative_parent.sv
Child module: https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/leaky_relu_derivative_child.sv

Core Modules

VPU Components

Leaky ReLU derivative module

Architecture

Module ports

leaky_relu_derivative_parent

leaky_relu_derivative_child

Derivative function

Operation

Algorithm

Pipeline stages

Why use H instead of Z?

Fixed-point arithmetic

Integration with VPU

Transition pathway (1111)

Backward pathway (0001)

Data flow

Transition phase

Backward phase

H value caching

Implementation details

Gradient flow example

Source files

Build docs developers (and LLMs) love

Core Modules

VPU Components

​Architecture

​Module ports

​leaky_relu_derivative_parent

​leaky_relu_derivative_child

​Derivative function

​Operation

​Algorithm

​Pipeline stages

​Why use H instead of Z?

​Fixed-point arithmetic

​Integration with VPU

​Transition pathway (1111)

​Backward pathway (0001)

​Data flow

​Transition phase

​Backward phase

​H value caching

​Implementation details

​Gradient flow example

​Source files

Build docs developers (and LLMs) love

Architecture

Module ports

leaky_relu_derivative_parent

leaky_relu_derivative_child

Derivative function

Operation

Algorithm

Pipeline stages

Why use H instead of Z?

Fixed-point arithmetic

Integration with VPU

Transition pathway (1111)

Backward pathway (0001)

Data flow

Transition phase

Backward phase

H value caching

Implementation details

Gradient flow example

Source files