Modern LLM

A from-scratch implementation of a frontier-style LLM training pipeline, demonstrating modern architectural choices and a complete alignment workflow. This project achieves impressive results with a 253M parameter model that outperforms GPT-2 by 33% on perplexity benchmarks.

What is Modern LLM?

Modern LLM is a comprehensive, production-ready implementation of a language model training pipeline that showcases:

Modern architecture - RoPE, RMSNorm, SwiGLU, and attention sinks
Complete alignment pipeline - Pretrain → SFT → DPO → Verifier
Research-grounded code - Every component includes paper references and mathematical documentation
Practical performance - Achieves 27.03 perplexity on WikiText-2, significantly beating GPT-2’s 40.64

Quick Start

Get up and running in 5 minutes with a smoke test, then explore pre-trained checkpoints

Installation

Complete installation guide with environment setup and dependency management

Architecture

Deep dive into the model architecture with RoPE, RMSNorm, SwiGLU, and attention sinks

Training Pipeline

Learn about the complete training workflow from pretraining to alignment

Performance results

Our 253M parameter model demonstrates strong performance across key metrics:

Perplexity on WikiText-2

Model	Parameters	Perplexity	vs GPT-2
GPT-2 (baseline)	124M	40.64	—
Modern LLM (pretrain)	253M	27.03	-33%
Modern LLM (SFT)	253M	34.14	-16%
Modern LLM (DPO)	253M	34.32	-16%

Lower perplexity is better. The pretrained model achieves the best perplexity score, while SFT and DPO models are optimized for instruction-following and preference alignment rather than raw language modeling.

Key features

RMSNorm

Root Mean Square LayerNorm (Zhang & Sennrich, 2019)Faster than LayerNorm with no mean subtraction, used in LLaMA and PaLM. Stabilizes training without centering.

y = x · γ / √(mean(x²) + ε)

RoPE

Rotary Position Embeddings (Su et al., 2021)Encodes relative positions via rotation matrices applied to Q/K. Better length extrapolation than absolute embeddings.

q' = q ⊙ cos(mθ) + rotate_half(q) ⊙ sin(mθ)

SwiGLU

Swish-Gated Linear Unit (Shazeer, 2020; PaLM, 2022)Gated linear unit with Swish activation. 2-4% better than GELU with similar parameter count.

SwiGLU(x) = (Wg·x ⊙ swish(Wv·x)) · Wo

Attention Sinks

Streaming Attention (Press et al., 2021; Xiao et al., 2023)Learnable “sink” tokens that every position can attend to, stabilizing generation beyond the training context length.

Training pipeline

The complete alignment workflow follows modern best practices:

Pretraining

Train the base language model on WikiText-103 and TinyStories (600M tokens total). The model learns general language understanding and generation capabilities.Output: 253M parameter base model with 27.03 perplexity on WikiText-2

Supervised Fine-Tuning (SFT)

Fine-tune on instruction-response pairs from the Alpaca dataset (52K examples). The model learns to follow instructions and respond helpfully.Reference: Ouyang et al., 2022 (InstructGPT)

Direct Preference Optimization (DPO)

Align the model with human preferences using the Anthropic HH-RLHF dataset (161K preference pairs). The model learns to generate responses humans prefer.Reference: Rafailov et al., 2023

Verifier Training

Train a separate model to score answer correctness on GSM8K math problems. The verifier can be used to filter or rank model outputs.Reference: Lightman et al., 2023

Why Modern LLM?

Research-grounded implementation

Every component includes detailed mathematical documentation and paper references. The codebase serves as both a working implementation and an educational resource.

# From src/modern_llm/models/layers.py:19-55
class RMSNorm(nn.Module):
    """Root Mean Square LayerNorm (Zhang & Sennrich, 2019).
    
    Math:
        y = x * γ / sqrt(mean(x^2) + ε)
        where γ is a learned weight vector.
    """
    def forward(self, x: Tensor) -> Tensor:
        # mean(x^2) is the RMS statistic from Zhang & Sennrich (2019, Eq. 3)
        variance = x.pow(2).mean(dim=-1, keepdim=True)
        normalized = x * torch.rsqrt(variance + self.eps)
        return normalized * self.weight

Complete alignment workflow

Unlike many implementations that stop at pretraining, Modern LLM includes the full alignment pipeline used in production LLMs:

Pretraining for language understanding
SFT for instruction-following
DPO for preference alignment (without RL complexity)
Verifier for answer validation

Modern architectural choices

The architecture implements state-of-the-art components from recent research:

RMSNorm instead of LayerNorm (faster, used in LLaMA)
RoPE instead of absolute positions (better extrapolation)
SwiGLU instead of GELU (2-4% better performance)
Attention sinks for long-context stability
Grouped Query Attention (optional) for efficient inference

Practical performance

The 253M parameter model achieves:

27.03 perplexity on WikiText-2 (vs GPT-2’s 40.64)
33% improvement over GPT-2 baseline
Competitive performance with 2x the parameters
Efficient training on consumer hardware (RTX 3060)

What’s included

Model architecture

Decoder-only Transformer
RoPE positional encodings
RMSNorm layers
SwiGLU activations
Attention sinks
Optional GQA and MoE

Training pipeline

Language model pretraining
Supervised fine-tuning
Direct preference optimization
Verifier training
Automatic mixed precision
Gradient accumulation

Evaluation tools

Perplexity computation
Few-shot task evaluation
Generation quality metrics
GPT-2 baseline comparison
Attention visualization

Model configuration

The 253M parameter model uses the following configuration:

{
  "d_model": 768,
  "n_layers": 12,
  "n_heads": 12,
  "ffn_hidden_size": 3072,
  "max_seq_len": 1024,
  "vocab_size": 50257,
  "use_rope": true,
  "use_attention_sinks": true,
  "num_attention_sinks": 4,
  "use_swiglu": true,
  "tie_embeddings": true
}

You can customize the model size by adjusting d_model, n_layers, and n_heads. The included RTX 3060 config is optimized for 12GB VRAM.

Next steps

Run the quick start

Get started in 5 minutes with a smoke test

Install dependencies

Set up your environment with all required packages

Explore the architecture

Learn about the model architecture in detail

Train your own model

Follow the complete training workflow

References

RMSNorm: Zhang, B., & Sennrich, R. (2019). Root Mean Square Layer Normalization. NeurIPS.
RoPE: Su, J., et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
SwiGLU: Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv:2002.05202.
Attention Sinks: Xiao, G., et al. (2023). Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453.
InstructGPT: Ouyang, L., et al. (2022). Training language models to follow instructions. NeurIPS.
DPO: Rafailov, R., et al. (2023). Direct Preference Optimization. NeurIPS.
Verifier: Lightman, H., et al. (2023). Let’s Verify Step by Step. arXiv:2305.20050.

Get Started

Architecture

Training Pipeline

Guides

Introduction

Modern LLM

What is Modern LLM?

Quick Start

Installation

Architecture

Training Pipeline

Performance results

Perplexity on WikiText-2

Key features

RMSNorm

RoPE

SwiGLU

Attention Sinks

Training pipeline

Why Modern LLM?

What’s included

Model architecture

Training pipeline

Evaluation tools

Model configuration

Next steps

Run the quick start

Install dependencies

Explore the architecture

Train your own model

References

Build docs developers (and LLMs) love

Get Started

Architecture

Training Pipeline

Guides

​Modern LLM

​What is Modern LLM?

Quick Start

Installation

Architecture

Training Pipeline

​Performance results

​Perplexity on WikiText-2

​Key features

RMSNorm

RoPE

SwiGLU

Attention Sinks

​Training pipeline

​Why Modern LLM?

​What’s included

Model architecture

Training pipeline

Evaluation tools

​Model configuration

​Next steps

Run the quick start

Install dependencies

Explore the architecture

Train your own model

​References

Build docs developers (and LLMs) love

Modern LLM

What is Modern LLM?

Performance results

Perplexity on WikiText-2

Key features

Training pipeline

Why Modern LLM?

What’s included

Model configuration

Next steps

References