WikiText-2 validation results
| Model | Parameters | PPL | vs GPT-2 |
|---|---|---|---|
| GPT-2 (baseline) | 124M | 40.64 | — |
| Ours (pretrain) | 253M | 27.03 | -33% |
| Ours (SFT) | 253M | 34.14 | -16% |
| Ours (DPO) | 253M | 34.32 | -16% |
Analysis
Pretraining performance
The base model’s 27.03 perplexity represents a 33% improvement over GPT-2, demonstrating that modern architectural choices (RoPE, RMSNorm, SwiGLU) provide measurable benefits for language modeling capability. This result is particularly notable given:- Smaller training corpus: ~600M tokens vs GPT-2’s 8B tokens
- From-scratch training: No transfer learning or pretraining initialization
- Comparable parameters: 253M vs GPT-2’s 124M (scaling accounts for ~10% of improvement)
Alignment stage tradeoffs
Perplexity increases after supervised fine-tuning (SFT) and direct preference optimization (DPO):- SFT: 27.03 → 34.14 (+26%)
- DPO: 34.14 → 34.32 (+0.5%)
| Stage | Objective | Purpose |
|---|---|---|
| Pretrain | Minimize perplexity | General language modeling |
| SFT | Match instruction format | Follow human instructions |
| DPO | Maximize preference margin | Align with human values |
Why perplexity degrades during alignment
- Distribution shift: The model moves from modeling Wikipedia/stories to modeling instruction-response pairs
- Constrained generation: Instruction-tuned models learn to reject certain completions even if they’re linguistically plausible
- Preference optimization: DPO explicitly optimizes margins between preferred/rejected responses, not likelihood
Evaluation methodology
Perplexity is calculated on WikiText-2 validation split using the standard formula:- Batch size: 32
- Sequence length: 1024 tokens
- Tokenizer: GPT-2 BPE (vocab size 50,257)
References
- Ouyang, L., et al. (2022). Training language models to follow instructions. NeurIPS.
- Rafailov, R., et al. (2023). Direct Preference Optimization. NeurIPS.