Design philosophy
Modern LLM implements a decoder-only transformer architecture that incorporates research-backed improvements from recent large language model development. The design prioritizes:- Training efficiency: RMSNorm and SwiGLU reduce computation while maintaining model quality
- Position encoding: RoPE enables better length extrapolation than learned position embeddings
- Long-context stability: Attention sinks improve performance on sequences beyond training length
- Memory efficiency: Grouped Query Attention (GQA) reduces KV cache memory requirements
The architecture follows the decoder-only design pioneered by GPT models (Vaswani et al., 2017), with modern enhancements from PaLM (Chowdhery et al., 2022), LLaMA (Touvron et al., 2023), and related work.
Model structure
The architecture consists of a stack of identical decoder blocks, each containing:- Multi-head attention with optional RoPE and attention sinks
- Feedforward network using SwiGLU activation
- RMSNorm applied before each sub-layer (pre-normalization)
- Residual connections around each sub-layer
- Block diagram
- Mathematical form
- Implementation
Key components
Each architectural component is chosen for specific benefits:RMSNorm - Efficient normalization
RMSNorm - Efficient normalization
RMSNorm (Zhang & Sennrich, 2019) simplifies LayerNorm by removing mean centering and re-centering operations. It normalizes activations using only the root mean square, reducing computation by ~7-64% depending on hardware.Formula:
y = x * γ / sqrt(mean(x²) + ε)Learn more in RMSNorm documentation.RoPE - Rotary position embeddings
RoPE - Rotary position embeddings
Rotary Position Embeddings (Su et al., 2021) encode position information by rotating query and key vectors in complex space. Unlike learned position embeddings, RoPE:
- Encodes relative positions naturally through rotation
- Extrapolates better to sequences longer than training length
- Requires no additional parameters
SwiGLU - Gated activation
SwiGLU - Gated activation
SwiGLU (Shazeer, 2020) combines Swish activation with gating mechanisms from the GLU family. The PaLM model (Chowdhery et al., 2022) demonstrated that SwiGLU improves quality over standard activations like GELU.Formula:
SwiGLU(x) = W_o[(W_g x) ⊙ swish(W_v x)]Learn more in SwiGLU documentation.Attention sinks - Long-context stability
Attention sinks - Long-context stability
Attention sinks (inspired by Press et al., 2021) are learnable tokens prepended to the sequence that every token can attend to. They improve model stability on long sequences by providing consistent attention targets.Learn more in Attention sinks documentation.
Complexity analysis
Computational complexity
For a model with:Llayersdmodel dimensionhattention headsssequence length
Memory complexity
Per-layer memory requirements:| Component | Parameters | Activation Memory (per token) |
|---|---|---|
| Attention QKV | 3d² | 3d |
| Attention output | d² | d |
| SwiGLU gate | d × 2h | 2h |
| SwiGLU proj | h × d | d |
| RMSNorm (×2) | 2d | 2d |
h is the feedforward hidden dimension (typically 4d).
Grouped Query Attention (GQA) reduces KV cache memory by sharing key/value projections across multiple query heads. For example, with 32 query heads and 8 KV heads, the KV cache size is reduced by 4×.
Configuration example
Here’s a typical small-scale configuration:References
Attention Is All You Need
Vaswani et al., 2017 - Original transformer architecture
Root Mean Square Layer Normalization
Zhang & Sennrich, 2019 - RMSNorm paper
RoFormer: Enhanced Transformer with Rotary Position Embedding
Su et al., 2021 - RoPE position encoding
GLU Variants Improve Transformer
Shazeer, 2020 - SwiGLU activation function
PaLM: Scaling Language Modeling with Pathways
Chowdhery et al., 2022 - Modern architecture design
Train Short, Test Long
Press et al., 2021 - Attention sinks motivation
Next steps
RMSNorm
Learn about efficient normalization
RoPE
Understand rotary position embeddings
SwiGLU
Explore gated activation functions
Attention sinks
Deep dive into long-context stability