Architecture overview

Design philosophy

Modern LLM implements a decoder-only transformer architecture that incorporates research-backed improvements from recent large language model development. The design prioritizes:

Training efficiency: RMSNorm and SwiGLU reduce computation while maintaining model quality
Position encoding: RoPE enables better length extrapolation than learned position embeddings
Long-context stability: Attention sinks improve performance on sequences beyond training length
Memory efficiency: Grouped Query Attention (GQA) reduces KV cache memory requirements

The architecture follows the decoder-only design pioneered by GPT models (Vaswani et al., 2017), with modern enhancements from PaLM (Chowdhery et al., 2022), LLaMA (Touvron et al., 2023), and related work.

Model structure

The architecture consists of a stack of identical decoder blocks, each containing:

Multi-head attention with optional RoPE and attention sinks
Feedforward network using SwiGLU activation
RMSNorm applied before each sub-layer (pre-normalization)
Residual connections around each sub-layer

Block diagram
Mathematical form
Implementation

Input Tokens
     ↓
Token Embedding
     ↓
┌────────────────────┐
│  Decoder Block 1   │
│  ┌──────────────┐  │
│  │   RMSNorm    │  │
│  ├──────────────┤  │
│  │ Multi-Head   │  │
│  │  Attention   │  │
│  │  (with RoPE) │  │
│  ├──────────────┤  │
│  │   + Residual │  │
│  ├──────────────┤  │
│  │   RMSNorm    │  │
│  ├──────────────┤  │
│  │   SwiGLU     │  │
│  │   Feedforward│  │
│  ├──────────────┤  │
│  │   + Residual │  │
│  └──────────────┘  │
└────────────────────┘
     ↓
... (N blocks)
     ↓
Final RMSNorm
     ↓
LM Head (Logits)

For each decoder block:

h' = h + MultiHeadAttention(RMSNorm(h))
h'' = h' + SwiGLU(RMSNorm(h'))

Where:

h is the hidden state
RMSNorm normalizes without mean centering
MultiHeadAttention implements scaled dot-product with RoPE rotations
SwiGLU is a gated feedforward with Swish activation

attention.py

class DecoderBlock(nn.Module):
    def __init__(self, config: ModernLLMConfig) -> None:
        super().__init__()
        self.attn = MultiHeadAttention(attn_config)
        self.attn_norm = RMSNorm(config.d_model, config.rmsnorm_eps)
        self.ffn_norm = RMSNorm(config.d_model, config.rmsnorm_eps)
        self.ffn = SwiGLU(config.d_model, hidden, out_features=config.d_model)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, hidden_states: Tensor, attention_mask: Optional[Tensor] = None) -> Tensor:
        # Pre-norm attention with residual
        attn_input = self.attn_norm(hidden_states)
        attn_output = self.attn(attn_input, attention_mask=attention_mask)
        hidden_states = hidden_states + self.dropout(attn_output)

        # Pre-norm feedforward with residual
        ffn_input = self.ffn_norm(hidden_states)
        ffn_output = self.ffn(ffn_input)
        hidden_states = hidden_states + self.dropout(ffn_output)
        return hidden_states

Key components

Each architectural component is chosen for specific benefits:

RMSNorm - Efficient normalization

RMSNorm (Zhang & Sennrich, 2019) simplifies LayerNorm by removing mean centering and re-centering operations. It normalizes activations using only the root mean square, reducing computation by ~7-64% depending on hardware.Formula: y = x * γ / sqrt(mean(x²) + ε)Learn more in RMSNorm documentation.

RoPE - Rotary position embeddings

Rotary Position Embeddings (Su et al., 2021) encode position information by rotating query and key vectors in complex space. Unlike learned position embeddings, RoPE:

Encodes relative positions naturally through rotation
Extrapolates better to sequences longer than training length
Requires no additional parameters

Learn more in RoPE documentation.

SwiGLU - Gated activation

SwiGLU (Shazeer, 2020) combines Swish activation with gating mechanisms from the GLU family. The PaLM model (Chowdhery et al., 2022) demonstrated that SwiGLU improves quality over standard activations like GELU.Formula: SwiGLU(x) = W_o[(W_g x) ⊙ swish(W_v x)]Learn more in SwiGLU documentation.

Attention sinks - Long-context stability

Attention sinks (inspired by Press et al., 2021) are learnable tokens prepended to the sequence that every token can attend to. They improve model stability on long sequences by providing consistent attention targets.Learn more in Attention sinks documentation.

Complexity analysis

Computational complexity

For a model with:

L layers
d model dimension
h attention heads
s sequence length

The forward pass complexity is:

O(L × s² × d/h)  [attention]
+ O(L × s × d × 4d)  [feedforward with 4d hidden size]
≈ O(L × s × d × (s/h + 4d))

The quadratic s² term in attention dominates for long sequences. RoPE and attention sinks help manage this by enabling efficient caching and stable generation.

Memory complexity

Per-layer memory requirements:

Component	Parameters	Activation Memory (per token)
Attention QKV	`3d²`	`3d`
Attention output	`d²`	`d`
SwiGLU gate	`d × 2h`	`2h`
SwiGLU proj	`h × d`	`d`
RMSNorm (×2)	`2d`	`2d`

Where h is the feedforward hidden dimension (typically 4d).

Grouped Query Attention (GQA) reduces KV cache memory by sharing key/value projections across multiple query heads. For example, with 32 query heads and 8 KV heads, the KV cache size is reduced by 4×.

Configuration example

Here’s a typical small-scale configuration:

from modern_llm.config import ModernLLMConfig

config = ModernLLMConfig(
    vocab_size=50257,
    d_model=768,
    n_heads=12,
    n_layers=12,
    ffn_hidden_size=3072,  # 4 × d_model
    max_seq_len=2048,
    
    # Modern optimizations
    use_rope=True,
    rope_theta=10000.0,
    use_attention_sinks=True,
    num_attention_sinks=2,
    use_gqa=True,
    gqa_groups=4,
    
    # Regularization
    dropout=0.1,
    rmsnorm_eps=1e-5,
)

References

Attention Is All You Need

Vaswani et al., 2017 - Original transformer architecture

Root Mean Square Layer Normalization

Zhang & Sennrich, 2019 - RMSNorm paper

RoFormer: Enhanced Transformer with Rotary Position Embedding

Su et al., 2021 - RoPE position encoding

GLU Variants Improve Transformer

Shazeer, 2020 - SwiGLU activation function

PaLM: Scaling Language Modeling with Pathways

Chowdhery et al., 2022 - Modern architecture design

Train Short, Test Long

Press et al., 2021 - Attention sinks motivation

Next steps

RMSNorm

Learn about efficient normalization

RoPE

Understand rotary position embeddings

SwiGLU

Explore gated activation functions

Attention sinks

Deep dive into long-context stability

Get Started

Architecture

Training Pipeline

Guides

Architecture overview

Design philosophy

Model structure

Key components

Complexity analysis

Computational complexity

Memory complexity

Configuration example

References

Attention Is All You Need

Root Mean Square Layer Normalization

RoFormer: Enhanced Transformer with Rotary Position Embedding

GLU Variants Improve Transformer

PaLM: Scaling Language Modeling with Pathways

Train Short, Test Long

Next steps

RMSNorm

RoPE

SwiGLU

Attention sinks

Build docs developers (and LLMs) love

Get Started

Architecture

Training Pipeline

Guides

​Design philosophy

​Model structure

​Key components

​Complexity analysis

​Computational complexity

​Memory complexity

​Configuration example

​References

Attention Is All You Need

Root Mean Square Layer Normalization

RoFormer: Enhanced Transformer with Rotary Position Embedding

GLU Variants Improve Transformer

PaLM: Scaling Language Modeling with Pathways

Train Short, Test Long

​Next steps

RMSNorm

RoPE

SwiGLU

Attention sinks

Build docs developers (and LLMs) love

Design philosophy

Model structure

Key components

Complexity analysis

Computational complexity

Memory complexity

Configuration example

References

Next steps