Skip to main content
The pretrained model achieves 27.03 perplexity on WikiText-2, significantly outperforming GPT-2’s 40.64 despite being trained from scratch with less data.

WikiText-2 validation results

ModelParametersPPLvs GPT-2
GPT-2 (baseline)124M40.64
Ours (pretrain)253M27.03-33%
Ours (SFT)253M34.14-16%
Ours (DPO)253M34.32-16%

Analysis

Pretraining performance

The base model’s 27.03 perplexity represents a 33% improvement over GPT-2, demonstrating that modern architectural choices (RoPE, RMSNorm, SwiGLU) provide measurable benefits for language modeling capability. This result is particularly notable given:
  • Smaller training corpus: ~600M tokens vs GPT-2’s 8B tokens
  • From-scratch training: No transfer learning or pretraining initialization
  • Comparable parameters: 253M vs GPT-2’s 124M (scaling accounts for ~10% of improvement)
The perplexity improvement validates the architectural design decisions and training pipeline implementation.

Alignment stage tradeoffs

Perplexity increases after supervised fine-tuning (SFT) and direct preference optimization (DPO):
  • SFT: 27.03 → 34.14 (+26%)
  • DPO: 34.14 → 34.32 (+0.5%)
This degradation is expected and well-documented in the alignment literature (Ouyang et al., 2022). The alignment stages optimize for different objectives:
StageObjectivePurpose
PretrainMinimize perplexityGeneral language modeling
SFTMatch instruction formatFollow human instructions
DPOMaximize preference marginAlign with human values
The perplexity increase reflects a fundamental tradeoff: models optimized for instruction-following and preference alignment sacrifice some generative modeling capability. This is acceptable because downstream applications prioritize response quality over raw perplexity.

Why perplexity degrades during alignment

  1. Distribution shift: The model moves from modeling Wikipedia/stories to modeling instruction-response pairs
  2. Constrained generation: Instruction-tuned models learn to reject certain completions even if they’re linguistically plausible
  3. Preference optimization: DPO explicitly optimizes margins between preferred/rejected responses, not likelihood
Despite higher perplexity, the aligned models demonstrate better instruction-following behavior in practice.

Evaluation methodology

Perplexity is calculated on WikiText-2 validation split using the standard formula:
PPL = exp(average cross-entropy loss)
Evaluation parameters:
  • Batch size: 32
  • Sequence length: 1024 tokens
  • Tokenizer: GPT-2 BPE (vocab size 50,257)
Results are directly comparable to published GPT-2 benchmarks using the same evaluation protocol.

References

  • Ouyang, L., et al. (2022). Training language models to follow instructions. NeurIPS.
  • Rafailov, R., et al. (2023). Direct Preference Optimization. NeurIPS.

Build docs developers (and LLMs) love