Skip to main content

Design philosophy

Modern LLM implements a decoder-only transformer architecture that incorporates research-backed improvements from recent large language model development. The design prioritizes:
  • Training efficiency: RMSNorm and SwiGLU reduce computation while maintaining model quality
  • Position encoding: RoPE enables better length extrapolation than learned position embeddings
  • Long-context stability: Attention sinks improve performance on sequences beyond training length
  • Memory efficiency: Grouped Query Attention (GQA) reduces KV cache memory requirements
The architecture follows the decoder-only design pioneered by GPT models (Vaswani et al., 2017), with modern enhancements from PaLM (Chowdhery et al., 2022), LLaMA (Touvron et al., 2023), and related work.

Model structure

The architecture consists of a stack of identical decoder blocks, each containing:
  1. Multi-head attention with optional RoPE and attention sinks
  2. Feedforward network using SwiGLU activation
  3. RMSNorm applied before each sub-layer (pre-normalization)
  4. Residual connections around each sub-layer
Input Tokens

Token Embedding

┌────────────────────┐
│  Decoder Block 1   │
│  ┌──────────────┐  │
│  │   RMSNorm    │  │
│  ├──────────────┤  │
│  │ Multi-Head   │  │
│  │  Attention   │  │
│  │  (with RoPE) │  │
│  ├──────────────┤  │
│  │   + Residual │  │
│  ├──────────────┤  │
│  │   RMSNorm    │  │
│  ├──────────────┤  │
│  │   SwiGLU     │  │
│  │   Feedforward│  │
│  ├──────────────┤  │
│  │   + Residual │  │
│  └──────────────┘  │
└────────────────────┘

... (N blocks)

Final RMSNorm

LM Head (Logits)

Key components

Each architectural component is chosen for specific benefits:
RMSNorm (Zhang & Sennrich, 2019) simplifies LayerNorm by removing mean centering and re-centering operations. It normalizes activations using only the root mean square, reducing computation by ~7-64% depending on hardware.Formula: y = x * γ / sqrt(mean(x²) + ε)Learn more in RMSNorm documentation.
Rotary Position Embeddings (Su et al., 2021) encode position information by rotating query and key vectors in complex space. Unlike learned position embeddings, RoPE:
  • Encodes relative positions naturally through rotation
  • Extrapolates better to sequences longer than training length
  • Requires no additional parameters
Learn more in RoPE documentation.
SwiGLU (Shazeer, 2020) combines Swish activation with gating mechanisms from the GLU family. The PaLM model (Chowdhery et al., 2022) demonstrated that SwiGLU improves quality over standard activations like GELU.Formula: SwiGLU(x) = W_o[(W_g x) ⊙ swish(W_v x)]Learn more in SwiGLU documentation.
Attention sinks (inspired by Press et al., 2021) are learnable tokens prepended to the sequence that every token can attend to. They improve model stability on long sequences by providing consistent attention targets.Learn more in Attention sinks documentation.

Complexity analysis

Computational complexity

For a model with:
  • L layers
  • d model dimension
  • h attention heads
  • s sequence length
The forward pass complexity is:
O(L × s² × d/h)  [attention]
+ O(L × s × d × 4d)  [feedforward with 4d hidden size]
≈ O(L × s × d × (s/h + 4d))
The quadratic term in attention dominates for long sequences. RoPE and attention sinks help manage this by enabling efficient caching and stable generation.

Memory complexity

Per-layer memory requirements:
ComponentParametersActivation Memory (per token)
Attention QKV3d²3d
Attention outputd
SwiGLU gated × 2h2h
SwiGLU projh × dd
RMSNorm (×2)2d2d
Where h is the feedforward hidden dimension (typically 4d).
Grouped Query Attention (GQA) reduces KV cache memory by sharing key/value projections across multiple query heads. For example, with 32 query heads and 8 KV heads, the KV cache size is reduced by 4×.

Configuration example

Here’s a typical small-scale configuration:
from modern_llm.config import ModernLLMConfig

config = ModernLLMConfig(
    vocab_size=50257,
    d_model=768,
    n_heads=12,
    n_layers=12,
    ffn_hidden_size=3072,  # 4 × d_model
    max_seq_len=2048,
    
    # Modern optimizations
    use_rope=True,
    rope_theta=10000.0,
    use_attention_sinks=True,
    num_attention_sinks=2,
    use_gqa=True,
    gqa_groups=4,
    
    # Regularization
    dropout=0.1,
    rmsnorm_eps=1e-5,
)

References

Attention Is All You Need

Vaswani et al., 2017 - Original transformer architecture

Root Mean Square Layer Normalization

Zhang & Sennrich, 2019 - RMSNorm paper

RoFormer: Enhanced Transformer with Rotary Position Embedding

Su et al., 2021 - RoPE position encoding

GLU Variants Improve Transformer

Shazeer, 2020 - SwiGLU activation function

PaLM: Scaling Language Modeling with Pathways

Chowdhery et al., 2022 - Modern architecture design

Train Short, Test Long

Press et al., 2021 - Attention sinks motivation

Next steps

RMSNorm

Learn about efficient normalization

RoPE

Understand rotary position embeddings

SwiGLU

Explore gated activation functions

Attention sinks

Deep dive into long-context stability

Build docs developers (and LLMs) love