Gradient descent module

The gradient descent module implements the weight and bias update step of the training process. It applies the gradient descent optimization algorithm to adjust network parameters based on computed gradients, enabling the network to learn from training data.

Module ports

clk

input logic

System clock signal

rst

input logic

Active-high reset signal

lr_in

input logic [15:0]

Learning rate (η) as 16-bit fixed-point value

value_old_in

input logic [15:0]

Current parameter value (weight or bias) before update

grad_in

input logic [15:0]

Computed gradient for this parameter

grad_descent_valid_in

input logic

Start signal indicating valid gradient and parameter inputs

grad_bias_or_weight

input logic

Parameter type selector: 0 = weight, 1 = bias

value_updated_out

output logic [15:0]

Updated parameter value after gradient descent step

grad_descent_done_out

output logic

Completion signal indicating update is finished

Gradient descent algorithm

The module implements the standard gradient descent update rule:

θ_new = θ_old - η × ∇L

Where:

θ represents a parameter (weight or bias)
η is the learning rate
∇L is the gradient of the loss with respect to the parameter
The negative sign indicates moving opposite to the gradient (downhill)

Operation

Update process

Gradient scaling: Multiply gradient by learning rate using fxp_mul
- Computes: scaled_gradient = grad × lr
- Uses fixed-point multiplication to preserve precision
Parameter update: Subtract scaled gradient from old value using fxp_addsub
- Computes: value_new = value_old - scaled_gradient
- Subtraction mode (sub=1) implements the negative gradient direction
Output registration: On clock edge, register updated value and assert done signal

Pipeline stages

Combinational computation:
- Multiply: mul_out = grad × lr
- Subtract: sub_value_out = sub_in_a - mul_out
Registered output:
- When grad_descent_valid_in is high, latch value_updated_out = sub_value_out
- Assert grad_descent_done_out to signal completion

Weight vs. bias handling

The module uses different update strategies based on grad_bias_or_weight:

Weight updates (grad_bias_or_weight = 0)

Weights may require accumulated updates:

If grad_descent_done_out is already asserted, use value_updated_out as the base
Otherwise, use value_old_in as the base
This allows multiple gradient contributions to be accumulated

See https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/gradient_descent.sv:50-55 for the implementation.

Bias updates (grad_bias_or_weight = 1)

Biases use simple updates:

Always use value_old_in as the base
Each update is independent

See https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/gradient_descent.sv:58-60.

Fixed-point arithmetic

The module uses 16-bit fixed-point representation (Q8.8 format):

Multiplication (fxp_mul):
- Computes grad × lr with proper binary point handling
- See https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/gradient_descent.sv:33-38
- Implementation at https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/fixedpoint.sv:278
Subtraction (fxp_addsub):
- Computes value - (grad × lr) with sub=1 mode
- See https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/gradient_descent.sv:40-46
- Implementation at https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/fixedpoint.sv:186

Both operations include overflow detection (though overflow signals are not currently used).

Control flow

The module uses combinational and sequential logic:

Combinational logic

Multiplexer for selecting subtraction input based on parameter type
Fixed-point arithmetic units (multiply and subtract)

See https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/gradient_descent.sv:48-62.

Sequential logic

Done signal registered from valid input signal
Updated value registered when valid input is high
Reset behavior clears all outputs to zero

See https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/gradient_descent.sv:64-77.

Integration with system

The gradient descent module operates outside the main VPU pipeline:

Typical usage flow

Gradient computation: VPU computes gradients for all parameters
Gradient accumulation: Gradients may be summed across batch (external to this module)
Parameter update: For each parameter:
- Load old value from unified buffer
- Load corresponding gradient
- Assert grad_descent_valid_in
- Wait for grad_descent_done_out
- Write updated value back to unified buffer
Iteration: Repeat for all weights and biases in the network

Learning rate

The learning rate is:

Set by the host system based on training hyperparameters
Represented as fixed-point (e.g., η=0.01 = 0x0028 in Q8.8)
Typically remains constant during training (though can be adjusted for learning rate schedules)
Shared across all parameter updates in an epoch

Implementation details

Latency: 1 clock cycle (combinational arithmetic + registered output)
Throughput: 1 parameter update per cycle
Parallelism: Module can be instantiated multiple times for concurrent updates
Reset behavior: All outputs cleared to zero
Done signal timing: Asserted one cycle after valid input

Example update

Consider updating a weight:

Old weight: w_old = 0.5 (0x0080 in Q8.8)
Gradient: ∂L/∂w = 0.2 (0x0033 in Q8.8)
Learning rate: η = 0.1 (0x0019 in Q8.8)

The module computes:

Scale gradient: 0.2 × 0.1 = 0.02 (0x0005 in Q8.8)
Update weight: 0.5 - 0.02 = 0.48 (0x007A in Q8.8)
Output: w_new = 0.48

The weight has moved in the direction that reduces loss.

Training loop integration

The gradient descent module is used during the parameter update phase of each training iteration:

Forward pass: Compute predictions (VPU forward pathway)
Loss computation: Compare predictions to targets (VPU transition pathway)
Backward pass: Compute gradients (VPU backward pathway + systolic array)
Parameter update: Apply gradient descent (this module)
Repeat: Next training iteration with updated parameters

Optimization considerations

Current implementation

Basic gradient descent (no momentum, no adaptive learning rates)
Single parameter updated per cycle
Simple accumulation logic for weight updates

Potential enhancements

Momentum: Add velocity term to smooth updates
Adaptive learning rates: Per-parameter learning rate adjustment (Adam, RMSprop)
Parallelization: Multiple gradient descent modules for faster updates
Learning rate decay: Automatic reduction over time

Source files

Module implementation: https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/gradient_descent.sv

Core Modules

VPU Components

Gradient descent module

Module ports

Gradient descent algorithm

Operation

Update process

Pipeline stages

Weight vs. bias handling

Weight updates (grad_bias_or_weight = 0)

Bias updates (grad_bias_or_weight = 1)

Fixed-point arithmetic

Control flow

Combinational logic

Sequential logic

Integration with system

Typical usage flow

Learning rate

Implementation details

Example update

Training loop integration

Optimization considerations

Current implementation

Potential enhancements

Source files

Build docs developers (and LLMs) love

Core Modules

VPU Components

​Module ports

​Gradient descent algorithm

​Operation

​Update process

​Pipeline stages

​Weight vs. bias handling

​Weight updates (grad_bias_or_weight = 0)

​Bias updates (grad_bias_or_weight = 1)

​Fixed-point arithmetic

​Control flow

​Combinational logic

​Sequential logic

​Integration with system

​Typical usage flow

​Learning rate

​Implementation details

​Example update

​Training loop integration

​Optimization considerations

​Current implementation

​Potential enhancements

​Source files

Build docs developers (and LLMs) love

Module ports

Gradient descent algorithm

Operation

Update process

Pipeline stages

Weight vs. bias handling

Weight updates (grad_bias_or_weight = 0)

Bias updates (grad_bias_or_weight = 1)

Fixed-point arithmetic

Control flow

Combinational logic

Sequential logic

Integration with system

Typical usage flow

Learning rate

Implementation details

Example update

Training loop integration

Optimization considerations

Current implementation

Potential enhancements

Source files