Task performance metrics

The model is evaluated on downstream tasks using few-shot prompting to measure discriminative and reasoning capabilities beyond language modeling.

Few-shot evaluation results

Model	SST-2 Acc	GSM8K EM
GPT-2	56.0%	N/A
Ours (pretrain)	49.5%	0.0%
Ours (SFT)	53.5%	0.0%
Ours (DPO)	49.5%	0.0%

SST-2 sentiment classification

The Stanford Sentiment Treebank (SST-2) evaluates binary sentiment classification using few-shot prompting. The model underperforms GPT-2 by 6.5 percentage points at the pretrain stage.

GSM8K math reasoning

Grade school math problems requiring multi-step arithmetic reasoning. All models achieve 0% exact match, indicating the 253M parameter scale is insufficient for mathematical reasoning tasks.

Stage-wise gains

Stage	SST-2 Acc	Δ SST-2	GSM8K EM	Δ GSM8K
Pretrain	49.5%	—	0.0%	—
SFT	53.5%	+4.0%	0.0%	+0.0%
DPO	49.5%	-4.0%	0.0%	+0.0%

Supervised fine-tuning impact

Instruction tuning provides a +4.0% gain on SST-2, demonstrating that the model learns to better follow task instructions when exposed to the Alpaca instruction dataset (52K examples).

Direct preference optimization regression

DPO causes a -4.0% drop in SST-2 accuracy, returning to pretrain baseline. This regression is a known tradeoff in preference optimization—the model optimizes for preference margins rather than downstream task accuracy. The DPO stage used the Anthropic HH-RLHF dataset (161K preference pairs), which focuses on helpfulness and harmlessness rather than classification accuracy.

Why task metrics lag GPT-2

Despite superior perplexity (27.03 vs 40.64), the model underperforms GPT-2 on SST-2. Several factors explain this discrepancy:

1. Evaluation methodology differences

Few-shot prompting is highly sensitive to prompt format:

GPT-2 has extensively studied prompt templates from years of research
Our model uses generic templates without optimization
Small prompt changes can cause 10-15% accuracy swings

2. Training data composition

Pretraining data strongly influences task performance:

Dataset	Our Mix	GPT-2 Mix	SST-2 Relevance
Wikipedia	77%	~20%	Factual, neutral
Creative stories	12%	~5%	Narrative style
WebText	0%	~75%	Opinion/review content

GPT-2’s WebText includes Reddit posts, reviews, and discussions—much closer to SST-2’s movie review domain. Our Wikipedia-heavy diet emphasizes factual content over sentiment-laden text.

3. Perplexity vs. discriminative tasks

These metrics measure different capabilities:

Perplexity: Generative modeling quality (“how well can you predict next tokens?”)
SST-2: Discriminative classification (“can you label sentiment correctly?”)

A model can excel at language modeling while struggling with classification if its training data doesn’t contain similar discriminative patterns.

4. Model capacity constraints

While 253M parameters suffice for language modeling, discriminative tasks may require different representational capacity or architectural biases that GPT-2 developed through its specific training regime.

GSM8K error analysis

From 50 evaluated problems, all incorrect:

Error Type	Pretrain	SFT	DPO	Description
Reasoning	47	47	44	Wrong approach or logic
Extraction	2	2	6	Answer present but not extracted
Arithmetic	1	1	0	Calculation errors

Error type breakdown

Reasoning errors (88-94% of failures) indicate the model lacks multi-step mathematical reasoning capability. Example failures:

Janet’s duck eggs problem: Predicted 3, correct answer 18
House flipping profit: Predicted 50, correct answer 70,000
Sprint distance: Predicted 33, correct answer 540

Extraction errors occur when the model generates the correct numerical answer in its reasoning but fails to format it properly for extraction. Arithmetic errors are rare, suggesting basic calculation is not the bottleneck—the issue is problem decomposition and logical reasoning.

Expected performance

The 0% exact match is expected for 253M models:

GSM8K requires multi-hop reasoning and symbolic manipulation
Literature shows meaningful performance emerges at 7B+ parameters
Both our model and GPT-2 (124M) achieve 0%

This validates that the model scale, not the implementation, is the limiting factor.

Verifier reranking results

A separate verifier model was trained to score solution correctness:

Model	Base Accuracy	+Verifier	Gain
Ours (DPO)	0.0%	0.0%	+0.0%
GPT-2	0.0%	0.0%	+0.0%

With 3 candidates per problem and verifier reranking, no improvement was observed. This indicates the base models never generate correct solutions among their top candidates—reranking cannot fix fundamental reasoning failures.

Evaluation protocol

SST-2 setup

Format: Few-shot prompting with 5 examples
Samples: 200 validation examples
Metric: Binary accuracy (positive/negative)
Temperature: 0.0 (greedy decoding)

GSM8K setup

Format: Chain-of-thought prompting
Samples: 50 test problems
Metric: Exact match on final numerical answer
Candidates: 3 per problem for verifier reranking
Temperature: 0.7

Key insights

Perplexity ≠ task performance: Superior language modeling doesn’t guarantee better downstream task accuracy
Data distribution matters: Training data composition has outsized impact on few-shot task performance
Alignment tradeoffs are real: DPO improves preference alignment at the cost of task metrics
Scale is fundamental: Some capabilities (multi-step reasoning) require significantly larger models

References

Socher, R., et al. (2013). Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP.
Cobbe, K., et al. (2021). Training Verifiers to Solve Math Word Problems. arXiv:2110.14168.
Rafailov, R., et al. (2023). Direct Preference Optimization. NeurIPS.

Benchmarks

Task performance metrics

Few-shot evaluation results

SST-2 sentiment classification

GSM8K math reasoning

Stage-wise gains

Supervised fine-tuning impact

Direct preference optimization regression

Why task metrics lag GPT-2

1. Evaluation methodology differences

2. Training data composition

3. Perplexity vs. discriminative tasks

4. Model capacity constraints

GSM8K error analysis

Error type breakdown

Expected performance

Verifier reranking results

Evaluation protocol

SST-2 setup

GSM8K setup

Key insights

References

Build docs developers (and LLMs) love

Benchmarks

​Few-shot evaluation results

​SST-2 sentiment classification

​GSM8K math reasoning

​Stage-wise gains

​Supervised fine-tuning impact

​Direct preference optimization regression

​Why task metrics lag GPT-2

​1. Evaluation methodology differences

​2. Training data composition

​3. Perplexity vs. discriminative tasks

​4. Model capacity constraints

​GSM8K error analysis

​Error type breakdown

​Expected performance

​Verifier reranking results

​Evaluation protocol

​SST-2 setup

​GSM8K setup

​Key insights

​References

Build docs developers (and LLMs) love

Few-shot evaluation results

SST-2 sentiment classification

GSM8K math reasoning

Stage-wise gains

Supervised fine-tuning impact

Direct preference optimization regression

Why task metrics lag GPT-2

1. Evaluation methodology differences

2. Training data composition

3. Perplexity vs. discriminative tasks

4. Model capacity constraints

GSM8K error analysis

Error type breakdown

Expected performance

Verifier reranking results

Evaluation protocol

SST-2 setup

GSM8K setup

Key insights

References