Modern LLM
A from-scratch implementation of a frontier-style LLM training pipeline, demonstrating modern architectural choices and a complete alignment workflow. This project achieves impressive results with a 253M parameter model that outperforms GPT-2 by 33% on perplexity benchmarks.What is Modern LLM?
Modern LLM is a comprehensive, production-ready implementation of a language model training pipeline that showcases:- Modern architecture - RoPE, RMSNorm, SwiGLU, and attention sinks
- Complete alignment pipeline - Pretrain → SFT → DPO → Verifier
- Research-grounded code - Every component includes paper references and mathematical documentation
- Practical performance - Achieves 27.03 perplexity on WikiText-2, significantly beating GPT-2’s 40.64
Quick Start
Get up and running in 5 minutes with a smoke test, then explore pre-trained checkpoints
Installation
Complete installation guide with environment setup and dependency management
Architecture
Deep dive into the model architecture with RoPE, RMSNorm, SwiGLU, and attention sinks
Training Pipeline
Learn about the complete training workflow from pretraining to alignment
Performance results
Our 253M parameter model demonstrates strong performance across key metrics:Perplexity on WikiText-2
| Model | Parameters | Perplexity | vs GPT-2 |
|---|---|---|---|
| GPT-2 (baseline) | 124M | 40.64 | — |
| Modern LLM (pretrain) | 253M | 27.03 | -33% |
| Modern LLM (SFT) | 253M | 34.14 | -16% |
| Modern LLM (DPO) | 253M | 34.32 | -16% |
Lower perplexity is better. The pretrained model achieves the best perplexity score, while SFT and DPO models are optimized for instruction-following and preference alignment rather than raw language modeling.
Key features
RMSNorm
Root Mean Square LayerNorm (Zhang & Sennrich, 2019)Faster than LayerNorm with no mean subtraction, used in LLaMA and PaLM. Stabilizes training without centering.
RoPE
Rotary Position Embeddings (Su et al., 2021)Encodes relative positions via rotation matrices applied to Q/K. Better length extrapolation than absolute embeddings.
SwiGLU
Swish-Gated Linear Unit (Shazeer, 2020; PaLM, 2022)Gated linear unit with Swish activation. 2-4% better than GELU with similar parameter count.
Attention Sinks
Streaming Attention (Press et al., 2021; Xiao et al., 2023)Learnable “sink” tokens that every position can attend to, stabilizing generation beyond the training context length.
Training pipeline
The complete alignment workflow follows modern best practices:Pretraining
Train the base language model on WikiText-103 and TinyStories (600M tokens total). The model learns general language understanding and generation capabilities.Output: 253M parameter base model with 27.03 perplexity on WikiText-2
Supervised Fine-Tuning (SFT)
Fine-tune on instruction-response pairs from the Alpaca dataset (52K examples). The model learns to follow instructions and respond helpfully.Reference: Ouyang et al., 2022 (InstructGPT)
Direct Preference Optimization (DPO)
Align the model with human preferences using the Anthropic HH-RLHF dataset (161K preference pairs). The model learns to generate responses humans prefer.Reference: Rafailov et al., 2023
Why Modern LLM?
Research-grounded implementation
Research-grounded implementation
Every component includes detailed mathematical documentation and paper references. The codebase serves as both a working implementation and an educational resource.
Complete alignment workflow
Complete alignment workflow
Unlike many implementations that stop at pretraining, Modern LLM includes the full alignment pipeline used in production LLMs:
- Pretraining for language understanding
- SFT for instruction-following
- DPO for preference alignment (without RL complexity)
- Verifier for answer validation
Modern architectural choices
Modern architectural choices
The architecture implements state-of-the-art components from recent research:
- RMSNorm instead of LayerNorm (faster, used in LLaMA)
- RoPE instead of absolute positions (better extrapolation)
- SwiGLU instead of GELU (2-4% better performance)
- Attention sinks for long-context stability
- Grouped Query Attention (optional) for efficient inference
Practical performance
Practical performance
The 253M parameter model achieves:
- 27.03 perplexity on WikiText-2 (vs GPT-2’s 40.64)
- 33% improvement over GPT-2 baseline
- Competitive performance with 2x the parameters
- Efficient training on consumer hardware (RTX 3060)
What’s included
Model architecture
- Decoder-only Transformer
- RoPE positional encodings
- RMSNorm layers
- SwiGLU activations
- Attention sinks
- Optional GQA and MoE
Training pipeline
- Language model pretraining
- Supervised fine-tuning
- Direct preference optimization
- Verifier training
- Automatic mixed precision
- Gradient accumulation
Evaluation tools
- Perplexity computation
- Few-shot task evaluation
- Generation quality metrics
- GPT-2 baseline comparison
- Attention visualization
Model configuration
The 253M parameter model uses the following configuration:Next steps
Run the quick start
Get started in 5 minutes with a smoke test
Install dependencies
Set up your environment with all required packages
Explore the architecture
Learn about the model architecture in detail
Train your own model
Follow the complete training workflow
References
- RMSNorm: Zhang, B., & Sennrich, R. (2019). Root Mean Square Layer Normalization. NeurIPS.
- RoPE: Su, J., et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
- SwiGLU: Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv:2002.05202.
- Attention Sinks: Xiao, G., et al. (2023). Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453.
- InstructGPT: Ouyang, L., et al. (2022). Training language models to follow instructions. NeurIPS.
- DPO: Rafailov, R., et al. (2023). Direct Preference Optimization. NeurIPS.
- Verifier: Lightman, H., et al. (2023). Let’s Verify Step by Step. arXiv:2305.20050.