Attention Layers

Attention mechanisms allow models to focus on relevant parts of the input, forming the core of Transformer architectures.

MultiheadAttention

Multi-Head Attention mechanism. The core building block of Transformers.

Constructor

class MultiheadAttention extends Module

constructor(
  embedDim: number,
  numHeads: number,
  options?: {
    bias?: boolean;
    dropout?: number;
  }
)

Parameters:

embedDim - Total dimension of the model (must be divisible by numHeads)
numHeads - Number of parallel attention heads
options.bias - Whether to add bias to projections (default: true)
options.dropout - Dropout probability for attention weights (default: 0.0)

Throws:

InvalidParameterError - If embedDim is not divisible by numHeads

Mathematical Formulation

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O
where head_i = Attention(Q * W_Q^i, K * W_K^i, V * W_V^i)

Where:

Q (Query), K (Key), V (Value) are input projections
d_k is the dimension of each head (embed_dim / num_heads)
W_Q, W_K, W_V, W_O are learnable weight matrices

Shape

Input:

Query: (batch, seq_len_q, embed_dim) or (seq_len_q, embed_dim)
Key: (batch, seq_len_k, embed_dim) or (seq_len_k, embed_dim)
Value: (batch, seq_len_v, embed_dim) or (seq_len_v, embed_dim)

Output: (batch, seq_len_q, embed_dim) or (seq_len_q, embed_dim)

Properties

embedDim: number - Model dimension
numHeads: number - Number of attention heads
headDim: number - Dimension per head (embedDim / numHeads)

Methods

forward

forward(
  query: AnyTensor,
  key?: AnyTensor,
  value?: AnyTensor
): GradTensor

Computes multi-head attention. Parameters:

query - Query tensor
key - Key tensor (defaults to query for self-attention)
value - Value tensor (defaults to query for self-attention)

Returns: Output tensor with attention applied Throws:

ShapeError - If input shapes are incompatible
DTypeError - If inputs have unsupported dtypes

Examples

Self-Attention

import { MultiheadAttention } from 'deepbox/nn';
import { tensor } from 'deepbox/ndarray';

const mha = new MultiheadAttention(512, 8);

// Self-attention: Q = K = V
const x = tensor(/* (batch=2, seq_len=10, embed_dim=512) */);
const output = mha.forward(x, x, x);
// Output: (batch=2, seq_len=10, embed_dim=512)

// Or simply:
const output2 = mha.forward(x); // Key and Value default to query

Cross-Attention

import { MultiheadAttention } from 'deepbox/nn';
import { tensor } from 'deepbox/ndarray';

const mha = new MultiheadAttention(512, 8);

// Encoder-decoder attention
const query = tensor(/* decoder states: (batch, dec_len, 512) */);
const key = tensor(/* encoder states: (batch, enc_len, 512) */);
const value = tensor(/* encoder states: (batch, enc_len, 512) */);

const output = mha.forward(query, key, value);
// Output: (batch, dec_len, 512)

With Dropout

const mha = new MultiheadAttention(512, 8, { dropout: 0.1 });
mha.train(); // Dropout active in training mode

const output = mha.forward(x);

mha.eval(); // Dropout disabled in eval mode

TransformerEncoderLayer

A single layer of the Transformer encoder. Consists of:

Multi-head self-attention
Add & Norm (residual connection + layer normalization)
Feed-forward network (FFN)
Add & Norm

Constructor

class TransformerEncoderLayer extends Module

constructor(
  dModel: number,
  nHead: number,
  dFF?: number,
  options?: {
    dropout?: number;
    eps?: number;
  }
)

// Or with object syntax:
constructor(options: {
  dModel: number;
  nHead: number;
  dimFeedforward?: number;
  dFF?: number;
  dropout?: number;
  eps?: number;
})

Parameters:

dModel - Model dimension (embedding dimension)
nHead - Number of attention heads
dFF / dimFeedforward - Dimension of feedforward network (default: 2048)
options.dropout - Dropout probability (default: 0.1)
options.eps - Layer norm epsilon (default: 1e-5)

Throws:

InvalidParameterError - If dModel is not divisible by nHead

Architecture

Input
  ↓
Multi-Head Self-Attention
  ↓
Dropout + Residual Connection
  ↓
Layer Normalization
  ↓
Feed-Forward Network (Linear -> ReLU -> Dropout -> Linear)
  ↓
Dropout + Residual Connection
  ↓
Layer Normalization
  ↓
Output

Shape

Input: (batch, seq_len, d_model) or (seq_len, d_model) Output: Same shape as input

Methods

forward

forward(src: AnyTensor): GradTensor

Processes source sequence through the encoder layer. Parameters:

src - Source sequence tensor

Returns: Encoded sequence tensor Throws:

ShapeError - If input shape is invalid
DTypeError - If input has unsupported dtype

Examples

Single Encoder Layer

import { TransformerEncoderLayer } from 'deepbox/nn';
import { tensor } from 'deepbox/ndarray';

const layer = new TransformerEncoderLayer(512, 8, 2048, {
  dropout: 0.1
});

const x = tensor(/* (batch=2, seq_len=10, d_model=512) */);
const output = layer.forward(x);
// Output: (batch=2, seq_len=10, d_model=512)

Stacked Encoder Layers

import { Module, TransformerEncoderLayer } from 'deepbox/nn';
import type { Tensor } from 'deepbox/ndarray';

class TransformerEncoder extends Module {
  private layers: TransformerEncoderLayer[];

  constructor(numLayers: number, dModel: number, nHead: number, dFF: number) {
    super();
    this.layers = [];
    
    for (let i = 0; i < numLayers; i++) {
      const layer = new TransformerEncoderLayer(dModel, nHead, dFF);
      this.layers.push(layer);
      this.registerModule(`layer_${i}`, layer);
    }
  }

  forward(x: Tensor): Tensor {
    let output = x;
    for (const layer of this.layers) {
      output = layer.forward(output);
    }
    return output;
  }
}

const encoder = new TransformerEncoder(6, 512, 8, 2048);

Complete Transformer Model

import { Module, TransformerEncoderLayer, Linear, Dropout } from 'deepbox/nn';
import { add, type Tensor } from 'deepbox/ndarray';

class PositionalEncoding extends Module {
  // Simplified - add sinusoidal positional encodings
  forward(x: Tensor): Tensor {
    // Implementation of positional encoding
    return x; // Placeholder
  }
}

class Transformer extends Module {
  private embedding: Linear;
  private posEncoding: PositionalEncoding;
  private encoderLayers: TransformerEncoderLayer[];
  private dropout: Dropout;
  private output: Linear;

  constructor(
    vocabSize: number,
    dModel: number,
    nHead: number,
    numLayers: number,
    dFF: number,
    numClasses: number
  ) {
    super();

    this.embedding = new Linear(vocabSize, dModel);
    this.posEncoding = new PositionalEncoding();
    this.dropout = new Dropout(0.1);
    
    this.encoderLayers = [];
    for (let i = 0; i < numLayers; i++) {
      const layer = new TransformerEncoderLayer(dModel, nHead, dFF);
      this.encoderLayers.push(layer);
      this.registerModule(`encoder_${i}`, layer);
    }
    
    this.output = new Linear(dModel, numClasses);

    this.registerModule('embedding', this.embedding);
    this.registerModule('posEncoding', this.posEncoding);
    this.registerModule('dropout', this.dropout);
    this.registerModule('output', this.output);
  }

  forward(x: Tensor): Tensor {
    // Embed tokens
    let out = this.embedding.forward(x);
    
    // Add positional encoding
    out = this.posEncoding.forward(out);
    out = this.dropout.forward(out);
    
    // Pass through encoder layers
    for (const layer of this.encoderLayers) {
      out = layer.forward(out);
    }
    
    // Output projection
    return this.output.forward(out);
  }
}

const model = new Transformer(
  10000,  // vocab_size
  512,    // d_model
  8,      // n_head
  6,      // num_layers
  2048,   // d_ff
  2       // num_classes
);

Understanding Attention

Attention Mechanism

Attention computes a weighted sum of values based on similarity between queries and keys:

// Simplified attention calculation
const scores = query.matmul(key.transpose());  // (seq_q, seq_k)
const weights = softmax(scores / sqrt(d_k));    // Normalize
const output = weights.matmul(value);           // Weighted sum

Multi-Head Attention Benefits

Multiple Representations: Different heads can attend to different aspects
Parallel Processing: Heads computed independently
Richer Patterns: Captures various relationships in the data
Better Gradients: Helps with optimization

Self-Attention vs Cross-Attention

Self-Attention (Q = K = V):

Each position attends to all positions in the same sequence
Used in encoder layers
Captures relationships within the input

Cross-Attention (Q ≠ K = V):

Query from one sequence, keys/values from another
Used in decoder layers (encoder-decoder attention)
Connects different sequences (e.g., source to target in translation)

Common Patterns

Vision Transformer (ViT) Encoder

import { Module, TransformerEncoderLayer, Linear } from 'deepbox/nn';
import type { Tensor } from 'deepbox/ndarray';

class ViTEncoder extends Module {
  private patchEmbed: Linear;
  private encoders: TransformerEncoderLayer[];
  private norm: LayerNorm;

  constructor(patchSize: number, embedDim: number, depth: number, numHeads: number) {
    super();
    
    const patchDim = patchSize * patchSize * 3; // RGB
    this.patchEmbed = new Linear(patchDim, embedDim);
    
    this.encoders = [];
    for (let i = 0; i < depth; i++) {
      const layer = new TransformerEncoderLayer(embedDim, numHeads);
      this.encoders.push(layer);
      this.registerModule(`encoder_${i}`, layer);
    }
    
    this.norm = new LayerNorm(embedDim);
    
    this.registerModule('patchEmbed', this.patchEmbed);
    this.registerModule('norm', this.norm);
  }

  forward(x: Tensor): Tensor {
    // x: (batch, num_patches, patch_dim)
    let out = this.patchEmbed.forward(x);
    
    for (const encoder of this.encoders) {
      out = encoder.forward(out);
    }
    
    return this.norm.forward(out);
  }
}

Masked Self-Attention (for Decoders)

// Note: Deepbox doesn't have built-in masking yet
// This is a conceptual example

function createCausalMask(seqLen: number): Tensor {
  // Create lower triangular mask
  // Prevents attending to future positions
  const mask = tensor(/* create mask */);
  return mask;
}

// In decoder:
const mask = createCausalMask(seqLen);
const output = mha.forward(query, key, value, mask);

Performance Considerations

Memory Usage: Attention is O(n²) in sequence length
Batch Size: Larger batches improve efficiency
Number of Heads: More heads = more parameters but richer representations
Feed-Forward Size: Usually 4x the model dimension
Dropout: Essential for regularization in transformers

Configuration Examples

BERT-style Encoder

const bertLayer = new TransformerEncoderLayer({
  dModel: 768,
  nHead: 12,
  dimFeedforward: 3072,  // 4 * 768
  dropout: 0.1
});

GPT-style Decoder (Encoder Layer as Building Block)

const gptLayer = new TransformerEncoderLayer({
  dModel: 1024,
  nHead: 16,
  dimFeedforward: 4096,  // 4 * 1024
  dropout: 0.1
});

Small Transformer (for Testing)

const smallLayer = new TransformerEncoderLayer({
  dModel: 128,
  nHead: 4,
  dimFeedforward: 512,
  dropout: 0.1
});

NDArray

DataFrame

Linear Algebra

Statistics

Machine Learning

Neural Networks

Optimization

Preprocessing

Metrics

Random

Plotting

Datasets

​MultiheadAttention

​Constructor

​Mathematical Formulation

​Shape

​Properties

​Methods

​forward

​Examples

​Self-Attention

​Cross-Attention

​With Dropout

​TransformerEncoderLayer

​Constructor

​Architecture

​Shape

​Methods

​forward

​Examples

​Single Encoder Layer

​Stacked Encoder Layers

​Complete Transformer Model

​Understanding Attention

​Attention Mechanism

​Multi-Head Attention Benefits

​Self-Attention vs Cross-Attention

​Common Patterns

​Vision Transformer (ViT) Encoder

​Masked Self-Attention (for Decoders)

​Performance Considerations

​Configuration Examples

​BERT-style Encoder

​GPT-style Decoder (Encoder Layer as Building Block)

​Small Transformer (for Testing)

​See Also

​References

Build docs developers (and LLMs) love

MultiheadAttention

Constructor

Mathematical Formulation

Shape

Properties

Methods

forward

Examples

Self-Attention

Cross-Attention

With Dropout

TransformerEncoderLayer

Constructor

Architecture

Shape

Methods

forward

Examples

Single Encoder Layer

Stacked Encoder Layers

Complete Transformer Model

Understanding Attention

Attention Mechanism

Multi-Head Attention Benefits

Self-Attention vs Cross-Attention

Common Patterns

Vision Transformer (ViT) Encoder

Masked Self-Attention (for Decoders)

Performance Considerations

Configuration Examples

BERT-style Encoder

GPT-style Decoder (Encoder Layer as Building Block)

Small Transformer (for Testing)

See Also

References