Skip to main content
This page shows actual instruction sequences from test/test_tpu.py that implement forward and backward propagation for a two-layer neural network.

Network architecture

The test implements XOR learning with:
  • Input layer: 2 features
  • Hidden layer: 2 neurons with Leaky ReLU activation
  • Output layer: 1 neuron with Leaky ReLU activation
  • Loss: Mean Squared Error (MSE)
  • Batch size: 4 samples

Training data

X = np.array([[0., 0.],
              [0., 1.],
              [1., 0.],
              [1., 1.]])

Y = np.array([0, 1, 1, 0])  # XOR truth table

Initial parameters

W1 = np.array([[0.2985, -0.5792], 
               [0.0913, 0.4234]])
B1 = [-0.4939, 0.189]

W2 = np.array([0.5266, 0.2958])
B2 = np.array([0.6358])

learning_rate = 0.75
leak_factor = 0.5

Initialization sequence

Before computation begins, configure global parameters:
# Set learning rate (stays constant)
dut.learning_rate_in.value = to_fixed(0.75)

# Set Leaky ReLU leak factor
dut.vpu_leak_factor_in.value = to_fixed(0.5)

# Set batch scaling for MSE gradient: 2/batch_size = 2/4 = 0.5
dut.inv_batch_size_times_two_in.value = to_fixed(2/len(X))
These parameters remain set throughout training and don’t need to be included in each instruction.

Loading data into Unified Buffer

Data is loaded using the dual-port write interface:
# Load X matrix (4x2) - using both write channels
for i in range(len(X) - 1):
    dut.ub_wr_host_data_in[0].value = to_fixed(X[i + 1][0])
    dut.ub_wr_host_valid_in[0].value = 1
    dut.ub_wr_host_data_in[1].value = to_fixed(X[i][1])
    dut.ub_wr_host_valid_in[1].value = 1
    await RisingEdge(dut.clk)

# Load Y vector (4x1) - using only channel 0
for i in range(len(Y) - 1):
    dut.ub_wr_host_data_in[0].value = to_fixed(Y[i + 1])
    dut.ub_wr_host_valid_in[0].value = 1
    dut.ub_wr_host_data_in[1].value = 0
    dut.ub_wr_host_valid_in[1].value = 0
    await RisingEdge(dut.clk)

# Similarly load W1, B1, W2, B2...
Instruction fields:
  • ub_wr_host_valid_in_1 [bit 3]: 1 when channel 0 has data
  • ub_wr_host_valid_in_2 [bit 4]: 1 when channel 1 has data
  • ub_wr_host_data_in_1 [35:20]: First data value
  • ub_wr_host_data_in_2 [51:36]: Second data value

Forward pass - Layer 1: H1 = LeakyReLU(X @ W1^T + B1)

Step 1: Load W1^T into systolic array

dut.ub_rd_start_in.value = 1
dut.ub_rd_transpose.value = 1      # Transpose during read
dut.ub_ptr_select.value = 1        # Route to systolic top (weights)
dut.ub_rd_addr_in.value = 12       # W1 stored at address 12
dut.ub_rd_row_size.value = 2
dut.ub_rd_col_size.value = 2
Instruction encoding (88-bit):
  • Bit 1 (ub_rd_start_in): 1
  • Bit 2 (ub_rd_transpose): 1
  • Bits [6:5] (ub_rd_col_size): 10 (2 columns)
  • Bits [14:7] (ub_rd_row_size): 00000010 (2 rows)
  • Bits [16:15] (ub_rd_addr_in): Implementation specific
  • Bits [19:17] (ub_ptr_sel): 001 (systolic top)

Step 2: Load X and configure VPU for forward pass

dut.ub_rd_start_in.value = 1
dut.ub_rd_transpose.value = 0      # No transpose
dut.ub_ptr_select.value = 0        # Route to systolic left (inputs)
dut.ub_rd_addr_in.value = 0        # X stored at address 0
dut.ub_rd_row_size.value = 4       # Batch size
dut.ub_rd_col_size.value = 2
dut.vpu_data_pathway.value = 0b1100  # Bias + Activation
Key fields:
  • Bits [55:52] (vpu_data_pathway): 1100 (forward pass routing)
  • Bits [14:7] (ub_rd_row_size): 00000100 (4 rows)

Step 3: Load B1 bias vector

dut.ub_rd_start_in.value = 1
dut.ub_rd_transpose.value = 0
dut.ub_ptr_select.value = 2        # Route to VPU bias module
dut.ub_rd_addr_in.value = 16       # B1 stored at address 16  
dut.ub_rd_row_size.value = 4       # Repeat bias for batch
dut.ub_rd_col_size.value = 2
dut.sys_switch_in.value = 0
Result: Systolic array computes X @ W1^T, VPU adds B1 and applies Leaky ReLU. Output H1 written back to UB.

Forward pass - Layer 2: H2 = LeakyReLU(H1 @ W2^T + B2)

Step 1: Load W2^T into systolic array

dut.ub_rd_start_in.value = 1
dut.ub_rd_transpose.value = 1
dut.ub_ptr_select.value = 1
dut.ub_rd_addr_in.value = 18       # W2 at address 18
dut.ub_rd_row_size.value = 1       # W2 is 1x2
dut.ub_rd_col_size.value = 2

Step 2: Load H1 and configure for loss computation

dut.ub_rd_start_in.value = 1
dut.ub_ptr_select.value = 0
dut.ub_rd_addr_in.value = 21       # H1 stored at address 21
dut.ub_rd_row_size.value = 4
dut.ub_rd_col_size.value = 2
dut.vpu_data_pathway.value = 0b1111  # Bias + Activation + Loss
Key difference: vpu_data_pathway = 0b1111 activates MSE loss module

Step 3: Load B2 bias

dut.ub_rd_start_in.value = 1
dut.ub_ptr_select.value = 2        # VPU bias
dut.ub_rd_addr_in.value = 20
dut.ub_rd_row_size.value = 4
dut.ub_rd_col_size.value = 1

Step 4: Load target Y for loss

dut.ub_rd_start_in.value = 1
dut.ub_ptr_select.value = 3        # VPU loss module
dut.ub_rd_addr_in.value = 8        # Y at address 8
dut.ub_rd_row_size.value = 4
dut.ub_rd_col_size.value = 1
Result: Computes H2 and immediately calculates dL/dZ2 = (H2 - Y) × 2/batch_size

Backward pass - Layer 2: dL/dZ1 = dL/dZ2 @ W2 ⊙ ReLU’(Z1)

Step 1: Load W2 (not transposed)

dut.ub_rd_start_in.value = 1
dut.ub_rd_transpose.value = 0      # No transpose for backprop
dut.ub_ptr_select.value = 1
dut.ub_rd_addr_in.value = 18
dut.ub_rd_row_size.value = 1
dut.ub_rd_col_size.value = 2

Step 2: Load dL/dZ2 gradient

dut.ub_rd_start_in.value = 1
dut.ub_ptr_select.value = 0
dut.ub_rd_addr_in.value = 29       # dL/dZ2 at address 29
dut.ub_rd_row_size.value = 4
dut.ub_rd_col_size.value = 1
dut.vpu_data_pathway.value = 0b0001  # Activation derivative only
Key field: vpu_data_pathway = 0b0001 for backpropagation through activation

Step 3: Load H1 for activation derivative

dut.ub_rd_start_in.value = 1
dut.ub_ptr_select.value = 4        # VPU activation derivative
dut.ub_rd_addr_in.value = 21
dut.ub_rd_row_size.value = 4
dut.ub_rd_col_size.value = 2
Result: Gradient propagated through layer 2, producing dL/dZ1

Weight gradient computation

Weight gradients use tiled matrix multiplication with bypass mode.

Computing dL/dW1 (first tile)

# Load first X tile into systolic top
dut.ub_rd_start_in.value = 1
dut.ub_ptr_select.value = 1
dut.ub_rd_addr_in.value = 0
dut.ub_rd_row_size.value = 2       # Tile size
dut.ub_rd_col_size.value = 2

# Load first (dL/dZ1)^T tile into systolic left
dut.ub_rd_start_in.value = 1
dut.ub_rd_transpose.value = 1      # Transpose gradient
dut.ub_ptr_select.value = 0
dut.ub_rd_addr_in.value = 33
dut.ub_rd_row_size.value = 2
dut.ub_rd_col_size.value = 2
dut.vpu_data_pathway.value = 0b0000  # Bypass - no VPU processing
Key field: vpu_data_pathway = 0b0000 bypasses VPU for gradient accumulation

Gradient descent update

# Route old weights to gradient descent
dut.ub_rd_start_in.value = 1
dut.ub_ptr_select.value = 6        # VPU gradient descent (weights)
dut.ub_rd_addr_in.value = 12       # Current W1
dut.ub_rd_row_size.value = 2
dut.ub_rd_col_size.value = 2
Result: VPU gradient descent module computes W_new = W_old - learning_rate × dL/dW

Timing and synchronization

Instructions follow this typical pattern:
# Cycle 1: Assert start and configure
dut.ub_rd_start_in.value = 1
dut.ub_ptr_select.value = X
# ... other config ...
await RisingEdge(dut.clk)

# Cycle 2: Clear start, operation continues
dut.ub_rd_start_in.value = 0
dut.ub_ptr_select.value = 0
await RisingEdge(dut.clk)

# Wait for completion
await FallingEdge(dut.vpu_valid_out_1)
The sys_switch_in signal toggles during multi-cycle operations to control when the systolic array is actively shifting data.

Complete instruction count

For one training iteration (forward + backward pass):
  • Data loading: ~12 instructions (dual-channel writes)
  • Forward layer 1: 3 read operations (W1, X, B1)
  • Forward layer 2: 4 read operations (W2, H1, B2, Y)
  • Backward layer 2: 3 read operations (W2, dL/dZ2, H1)
  • Weight gradients: 8 read operations (tiled computation for W1, W2)
  • Gradient descent: 4 read operations (update W1, B1, W2, B2)
Total: ~34 instructions per training iteration
For the complete test sequence with all signal values, see test/test_tpu.py:66-590 in the source repository.

Build docs developers (and LLMs) love