ModernDecoderLM
Decoder-only language model with RoPE positional embeddings and RMSNorm normalization. The model follows GPT-style causal language modeling but uses modern architectural choices:- RMSNorm instead of LayerNorm (Zhang & Sennrich, 2019)
- SwiGLU activation instead of GELU (Shazeer, 2020; PaLM, 2022)
- Rotary positional embeddings (Su et al., 2021)
- Optional mixture of experts (MoE) layers
Constructor
Model configuration containing all hyperparameters including vocabulary size, model dimensions, number of layers and heads, dropout rates, and architectural choices.
Attributes
The configuration object passed during initialization.
Token embedding layer mapping vocabulary indices to d_model dimensional vectors. Shape: (vocab_size, d_model).
List of n_layers decoder blocks, each containing multi-head attention and feedforward layers.
Final RMSNorm layer applied before the language model head.
Output projection from d_model to vocab_size. Weights are tied with token_embed when tie_embeddings=True.
forward
Input token IDs of shape (batch, seq_len). Values must be in range [0, vocab_size).
Attention mask of shape (batch, seq_len) with 1 for tokens to attend to and 0 for padding. Defaults to all ones.
Target token IDs for computing cross-entropy loss. Shape must match input_ids. Use -100 to ignore specific positions.
Returns
Output logits of shape (batch, seq_len, vocab_size) representing next-token predictions.
Cross-entropy loss computed when labels are provided. Scalar tensor.
Example
Complexity
The forward pass has complexity O(n_layers · seq_len² · d_model / n_heads) due to quadratic attention.DecoderBlock
Transformer decoder block implementing pre-normalization with residual connections. Each block applies:- Multi-head self-attention with RoPE
- Feedforward network (SwiGLU or MoE)
- Residual connections and RMSNorm before each sub-layer
Architecture
Constructor
Model configuration. The block uses d_model, n_heads, dropout, and attention configuration parameters.
Attributes
Multi-head self-attention layer with optional RoPE, GQA, and attention sinks.
RMSNorm layer applied before attention.
Feedforward network. Uses SwiGLU by default or MixtureOfExperts when use_moe=True.
RMSNorm layer applied before feedforward network.
Dropout applied to attention and feedforward outputs.
forward
Input tensor of shape (batch, seq_len, d_model).
Additive attention bias of shape (batch, 1, seq_len, seq_len) with zeros for valid positions and -inf for masked positions.
Returns
Transformed hidden states with same shape (batch, seq_len, d_model).