MultiheadAttention
Multi-Head Attention mechanism. The core building block of Transformers.Constructor
embedDim- Total dimension of the model (must be divisible by numHeads)numHeads- Number of parallel attention headsoptions.bias- Whether to add bias to projections (default: true)options.dropout- Dropout probability for attention weights (default: 0.0)
InvalidParameterError- If embedDim is not divisible by numHeads
Mathematical Formulation
Q(Query),K(Key),V(Value) are input projectionsd_kis the dimension of each head (embed_dim / num_heads)W_Q,W_K,W_V,W_Oare learnable weight matrices
Shape
Input:- Query:
(batch, seq_len_q, embed_dim)or(seq_len_q, embed_dim) - Key:
(batch, seq_len_k, embed_dim)or(seq_len_k, embed_dim) - Value:
(batch, seq_len_v, embed_dim)or(seq_len_v, embed_dim)
(batch, seq_len_q, embed_dim) or (seq_len_q, embed_dim)
Properties
embedDim: number- Model dimensionnumHeads: number- Number of attention headsheadDim: number- Dimension per head (embedDim / numHeads)
Methods
forward
query- Query tensorkey- Key tensor (defaults to query for self-attention)value- Value tensor (defaults to query for self-attention)
ShapeError- If input shapes are incompatibleDTypeError- If inputs have unsupported dtypes
Examples
Self-Attention
Cross-Attention
With Dropout
TransformerEncoderLayer
A single layer of the Transformer encoder. Consists of:- Multi-head self-attention
- Add & Norm (residual connection + layer normalization)
- Feed-forward network (FFN)
- Add & Norm
Constructor
dModel- Model dimension (embedding dimension)nHead- Number of attention headsdFF/dimFeedforward- Dimension of feedforward network (default: 2048)options.dropout- Dropout probability (default: 0.1)options.eps- Layer norm epsilon (default: 1e-5)
InvalidParameterError- If dModel is not divisible by nHead
Architecture
Shape
Input:(batch, seq_len, d_model) or (seq_len, d_model)
Output: Same shape as input
Methods
forward
src- Source sequence tensor
ShapeError- If input shape is invalidDTypeError- If input has unsupported dtype
Examples
Single Encoder Layer
Stacked Encoder Layers
Complete Transformer Model
Understanding Attention
Attention Mechanism
Attention computes a weighted sum of values based on similarity between queries and keys:Multi-Head Attention Benefits
- Multiple Representations: Different heads can attend to different aspects
- Parallel Processing: Heads computed independently
- Richer Patterns: Captures various relationships in the data
- Better Gradients: Helps with optimization
Self-Attention vs Cross-Attention
Self-Attention (Q = K = V):- Each position attends to all positions in the same sequence
- Used in encoder layers
- Captures relationships within the input
- Query from one sequence, keys/values from another
- Used in decoder layers (encoder-decoder attention)
- Connects different sequences (e.g., source to target in translation)
Common Patterns
Vision Transformer (ViT) Encoder
Masked Self-Attention (for Decoders)
Performance Considerations
- Memory Usage: Attention is O(n²) in sequence length
- Batch Size: Larger batches improve efficiency
- Number of Heads: More heads = more parameters but richer representations
- Feed-Forward Size: Usually 4x the model dimension
- Dropout: Essential for regularization in transformers
Configuration Examples
BERT-style Encoder
GPT-style Decoder (Encoder Layer as Building Block)
Small Transformer (for Testing)
See Also
- Recurrent Layers - RNN, LSTM, GRU alternatives
- Normalization - LayerNorm used in transformers
- Linear Layer - For projections
- Activation Functions - GELU commonly used with transformers