Skip to main content
The model is evaluated on downstream tasks using few-shot prompting to measure discriminative and reasoning capabilities beyond language modeling.

Few-shot evaluation results

ModelSST-2 AccGSM8K EM
GPT-256.0%N/A
Ours (pretrain)49.5%0.0%
Ours (SFT)53.5%0.0%
Ours (DPO)49.5%0.0%

SST-2 sentiment classification

The Stanford Sentiment Treebank (SST-2) evaluates binary sentiment classification using few-shot prompting. The model underperforms GPT-2 by 6.5 percentage points at the pretrain stage.

GSM8K math reasoning

Grade school math problems requiring multi-step arithmetic reasoning. All models achieve 0% exact match, indicating the 253M parameter scale is insufficient for mathematical reasoning tasks.

Stage-wise gains

StageSST-2 AccΔ SST-2GSM8K EMΔ GSM8K
Pretrain49.5%0.0%
SFT53.5%+4.0%0.0%+0.0%
DPO49.5%-4.0%0.0%+0.0%

Supervised fine-tuning impact

Instruction tuning provides a +4.0% gain on SST-2, demonstrating that the model learns to better follow task instructions when exposed to the Alpaca instruction dataset (52K examples).

Direct preference optimization regression

DPO causes a -4.0% drop in SST-2 accuracy, returning to pretrain baseline. This regression is a known tradeoff in preference optimization—the model optimizes for preference margins rather than downstream task accuracy. The DPO stage used the Anthropic HH-RLHF dataset (161K preference pairs), which focuses on helpfulness and harmlessness rather than classification accuracy.

Why task metrics lag GPT-2

Despite superior perplexity (27.03 vs 40.64), the model underperforms GPT-2 on SST-2. Several factors explain this discrepancy:

1. Evaluation methodology differences

Few-shot prompting is highly sensitive to prompt format:
  • GPT-2 has extensively studied prompt templates from years of research
  • Our model uses generic templates without optimization
  • Small prompt changes can cause 10-15% accuracy swings

2. Training data composition

Pretraining data strongly influences task performance:
DatasetOur MixGPT-2 MixSST-2 Relevance
Wikipedia77%~20%Factual, neutral
Creative stories12%~5%Narrative style
WebText0%~75%Opinion/review content
GPT-2’s WebText includes Reddit posts, reviews, and discussions—much closer to SST-2’s movie review domain. Our Wikipedia-heavy diet emphasizes factual content over sentiment-laden text.

3. Perplexity vs. discriminative tasks

These metrics measure different capabilities:
  • Perplexity: Generative modeling quality (“how well can you predict next tokens?”)
  • SST-2: Discriminative classification (“can you label sentiment correctly?”)
A model can excel at language modeling while struggling with classification if its training data doesn’t contain similar discriminative patterns.

4. Model capacity constraints

While 253M parameters suffice for language modeling, discriminative tasks may require different representational capacity or architectural biases that GPT-2 developed through its specific training regime.

GSM8K error analysis

From 50 evaluated problems, all incorrect:
Error TypePretrainSFTDPODescription
Reasoning474744Wrong approach or logic
Extraction226Answer present but not extracted
Arithmetic110Calculation errors

Error type breakdown

Reasoning errors (88-94% of failures) indicate the model lacks multi-step mathematical reasoning capability. Example failures:
  • Janet’s duck eggs problem: Predicted 3, correct answer 18
  • House flipping profit: Predicted 50, correct answer 70,000
  • Sprint distance: Predicted 33, correct answer 540
Extraction errors occur when the model generates the correct numerical answer in its reasoning but fails to format it properly for extraction. Arithmetic errors are rare, suggesting basic calculation is not the bottleneck—the issue is problem decomposition and logical reasoning.

Expected performance

The 0% exact match is expected for 253M models:
  • GSM8K requires multi-hop reasoning and symbolic manipulation
  • Literature shows meaningful performance emerges at 7B+ parameters
  • Both our model and GPT-2 (124M) achieve 0%
This validates that the model scale, not the implementation, is the limiting factor.

Verifier reranking results

A separate verifier model was trained to score solution correctness:
ModelBase Accuracy+VerifierGain
Ours (DPO)0.0%0.0%+0.0%
GPT-20.0%0.0%+0.0%
With 3 candidates per problem and verifier reranking, no improvement was observed. This indicates the base models never generate correct solutions among their top candidates—reranking cannot fix fundamental reasoning failures.

Evaluation protocol

SST-2 setup

  • Format: Few-shot prompting with 5 examples
  • Samples: 200 validation examples
  • Metric: Binary accuracy (positive/negative)
  • Temperature: 0.0 (greedy decoding)

GSM8K setup

  • Format: Chain-of-thought prompting
  • Samples: 50 test problems
  • Metric: Exact match on final numerical answer
  • Candidates: 3 per problem for verifier reranking
  • Temperature: 0.7

Key insights

  1. Perplexity ≠ task performance: Superior language modeling doesn’t guarantee better downstream task accuracy
  2. Data distribution matters: Training data composition has outsized impact on few-shot task performance
  3. Alignment tradeoffs are real: DPO improves preference alignment at the cost of task metrics
  4. Scale is fundamental: Some capabilities (multi-step reasoning) require significantly larger models

References

  • Socher, R., et al. (2013). Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP.
  • Cobbe, K., et al. (2021). Training Verifiers to Solve Math Word Problems. arXiv:2110.14168.
  • Rafailov, R., et al. (2023). Direct Preference Optimization. NeurIPS.

Build docs developers (and LLMs) love