Skip to main content

ModernDecoderLM

Decoder-only language model with RoPE positional embeddings and RMSNorm normalization. The model follows GPT-style causal language modeling but uses modern architectural choices:
  • RMSNorm instead of LayerNorm (Zhang & Sennrich, 2019)
  • SwiGLU activation instead of GELU (Shazeer, 2020; PaLM, 2022)
  • Rotary positional embeddings (Su et al., 2021)
  • Optional mixture of experts (MoE) layers

Constructor

from modern_llm.models.transformer import ModernDecoderLM
from modern_llm.config.model_config import ModernLLMConfig

config = ModernLLMConfig(
    vocab_size=50257,
    d_model=768,
    n_layers=12,
    n_heads=12,
    max_seq_len=2048
)
model = ModernDecoderLM(config)
config
ModernLLMConfig
required
Model configuration containing all hyperparameters including vocabulary size, model dimensions, number of layers and heads, dropout rates, and architectural choices.

Attributes

config
ModernLLMConfig
The configuration object passed during initialization.
token_embed
nn.Embedding
Token embedding layer mapping vocabulary indices to d_model dimensional vectors. Shape: (vocab_size, d_model).
blocks
nn.ModuleList
List of n_layers decoder blocks, each containing multi-head attention and feedforward layers.
final_norm
RMSNorm
Final RMSNorm layer applied before the language model head.
lm_head
nn.Linear
Output projection from d_model to vocab_size. Weights are tied with token_embed when tie_embeddings=True.

forward

def forward(
    self,
    input_ids: Tensor,
    attention_mask: Optional[Tensor] = None,
    labels: Optional[Tensor] = None,
) -> Dict[str, Optional[Tensor]]
Causal language model forward pass.
input_ids
Tensor
required
Input token IDs of shape (batch, seq_len). Values must be in range [0, vocab_size).
attention_mask
Optional[Tensor]
Attention mask of shape (batch, seq_len) with 1 for tokens to attend to and 0 for padding. Defaults to all ones.
labels
Optional[Tensor]
Target token IDs for computing cross-entropy loss. Shape must match input_ids. Use -100 to ignore specific positions.

Returns

logits
Tensor
Output logits of shape (batch, seq_len, vocab_size) representing next-token predictions.
loss
Optional[Tensor]
Cross-entropy loss computed when labels are provided. Scalar tensor.

Example

import torch
from modern_llm.models.transformer import ModernDecoderLM
from modern_llm.config.model_config import ModernLLMConfig

# Initialize model
config = ModernLLMConfig(
    vocab_size=50257,
    d_model=768,
    n_layers=12,
    n_heads=12,
    max_seq_len=2048,
    dropout=0.1,
    use_rope=True,
    tie_embeddings=True
)
model = ModernDecoderLM(config)

# Forward pass for generation
input_ids = torch.randint(0, config.vocab_size, (2, 128))
outputs = model(input_ids)
logits = outputs["logits"]  # Shape: (2, 128, 50257)

# Forward pass with training loss
labels = torch.randint(0, config.vocab_size, (2, 128))
outputs = model(input_ids, labels=labels)
loss = outputs["loss"]  # Scalar tensor

Complexity

The forward pass has complexity O(n_layers · seq_len² · d_model / n_heads) due to quadratic attention.

DecoderBlock

Transformer decoder block implementing pre-normalization with residual connections. Each block applies:
  1. Multi-head self-attention with RoPE
  2. Feedforward network (SwiGLU or MoE)
  3. Residual connections and RMSNorm before each sub-layer

Architecture

h' = h + MultiHeadAttention(RMSNorm(h))
h'' = h' + SwiGLU(RMSNorm(h'))

Constructor

from modern_llm.models.transformer import DecoderBlock
from modern_llm.config.model_config import ModernLLMConfig

config = ModernLLMConfig(d_model=768, n_heads=12)
block = DecoderBlock(config)
config
ModernLLMConfig
required
Model configuration. The block uses d_model, n_heads, dropout, and attention configuration parameters.

Attributes

attn
MultiHeadAttention
Multi-head self-attention layer with optional RoPE, GQA, and attention sinks.
attn_norm
RMSNorm
RMSNorm layer applied before attention.
ffn
Union[SwiGLU, MixtureOfExperts]
Feedforward network. Uses SwiGLU by default or MixtureOfExperts when use_moe=True.
ffn_norm
RMSNorm
RMSNorm layer applied before feedforward network.
dropout
nn.Dropout
Dropout applied to attention and feedforward outputs.

forward

def forward(
    self,
    hidden_states: Tensor,
    attention_mask: Optional[Tensor] = None
) -> Tensor
Apply decoder block transformations.
hidden_states
Tensor
required
Input tensor of shape (batch, seq_len, d_model).
attention_mask
Optional[Tensor]
Additive attention bias of shape (batch, 1, seq_len, seq_len) with zeros for valid positions and -inf for masked positions.

Returns

output
Tensor
Transformed hidden states with same shape (batch, seq_len, d_model).

Example

import torch
from modern_llm.models.transformer import DecoderBlock
from modern_llm.config.model_config import ModernLLMConfig

config = ModernLLMConfig(
    d_model=768,
    n_heads=12,
    ffn_hidden_size=3072,
    dropout=0.1
)
block = DecoderBlock(config)

# Process hidden states
hidden_states = torch.randn(2, 128, 768)
output = block(hidden_states)
print(output.shape)  # torch.Size([2, 128, 768])

Complexity

Dominated by attention: O(seq_len² · d_model / n_heads) per block.

Build docs developers (and LLMs) love