Few-shot evaluation results
| Model | SST-2 Acc | GSM8K EM |
|---|---|---|
| GPT-2 | 56.0% | N/A |
| Ours (pretrain) | 49.5% | 0.0% |
| Ours (SFT) | 53.5% | 0.0% |
| Ours (DPO) | 49.5% | 0.0% |
SST-2 sentiment classification
The Stanford Sentiment Treebank (SST-2) evaluates binary sentiment classification using few-shot prompting. The model underperforms GPT-2 by 6.5 percentage points at the pretrain stage.GSM8K math reasoning
Grade school math problems requiring multi-step arithmetic reasoning. All models achieve 0% exact match, indicating the 253M parameter scale is insufficient for mathematical reasoning tasks.Stage-wise gains
| Stage | SST-2 Acc | Δ SST-2 | GSM8K EM | Δ GSM8K |
|---|---|---|---|---|
| Pretrain | 49.5% | — | 0.0% | — |
| SFT | 53.5% | +4.0% | 0.0% | +0.0% |
| DPO | 49.5% | -4.0% | 0.0% | +0.0% |
Supervised fine-tuning impact
Instruction tuning provides a +4.0% gain on SST-2, demonstrating that the model learns to better follow task instructions when exposed to the Alpaca instruction dataset (52K examples).Direct preference optimization regression
DPO causes a -4.0% drop in SST-2 accuracy, returning to pretrain baseline. This regression is a known tradeoff in preference optimization—the model optimizes for preference margins rather than downstream task accuracy. The DPO stage used the Anthropic HH-RLHF dataset (161K preference pairs), which focuses on helpfulness and harmlessness rather than classification accuracy.Why task metrics lag GPT-2
Despite superior perplexity (27.03 vs 40.64), the model underperforms GPT-2 on SST-2. Several factors explain this discrepancy:1. Evaluation methodology differences
Few-shot prompting is highly sensitive to prompt format:- GPT-2 has extensively studied prompt templates from years of research
- Our model uses generic templates without optimization
- Small prompt changes can cause 10-15% accuracy swings
2. Training data composition
Pretraining data strongly influences task performance:| Dataset | Our Mix | GPT-2 Mix | SST-2 Relevance |
|---|---|---|---|
| Wikipedia | 77% | ~20% | Factual, neutral |
| Creative stories | 12% | ~5% | Narrative style |
| WebText | 0% | ~75% | Opinion/review content |
3. Perplexity vs. discriminative tasks
These metrics measure different capabilities:- Perplexity: Generative modeling quality (“how well can you predict next tokens?”)
- SST-2: Discriminative classification (“can you label sentiment correctly?”)
4. Model capacity constraints
While 253M parameters suffice for language modeling, discriminative tasks may require different representational capacity or architectural biases that GPT-2 developed through its specific training regime.GSM8K error analysis
From 50 evaluated problems, all incorrect:| Error Type | Pretrain | SFT | DPO | Description |
|---|---|---|---|---|
| Reasoning | 47 | 47 | 44 | Wrong approach or logic |
| Extraction | 2 | 2 | 6 | Answer present but not extracted |
| Arithmetic | 1 | 1 | 0 | Calculation errors |
Error type breakdown
Reasoning errors (88-94% of failures) indicate the model lacks multi-step mathematical reasoning capability. Example failures:- Janet’s duck eggs problem: Predicted 3, correct answer 18
- House flipping profit: Predicted 50, correct answer 70,000
- Sprint distance: Predicted 33, correct answer 540
Expected performance
The 0% exact match is expected for 253M models:- GSM8K requires multi-hop reasoning and symbolic manipulation
- Literature shows meaningful performance emerges at 7B+ parameters
- Both our model and GPT-2 (124M) achieve 0%
Verifier reranking results
A separate verifier model was trained to score solution correctness:| Model | Base Accuracy | +Verifier | Gain |
|---|---|---|---|
| Ours (DPO) | 0.0% | 0.0% | +0.0% |
| GPT-2 | 0.0% | 0.0% | +0.0% |
Evaluation protocol
SST-2 setup
- Format: Few-shot prompting with 5 examples
- Samples: 200 validation examples
- Metric: Binary accuracy (positive/negative)
- Temperature: 0.0 (greedy decoding)
GSM8K setup
- Format: Chain-of-thought prompting
- Samples: 50 test problems
- Metric: Exact match on final numerical answer
- Candidates: 3 per problem for verifier reranking
- Temperature: 0.7
Key insights
- Perplexity ≠ task performance: Superior language modeling doesn’t guarantee better downstream task accuracy
- Data distribution matters: Training data composition has outsized impact on few-shot task performance
- Alignment tradeoffs are real: DPO improves preference alignment at the cost of task metrics
- Scale is fundamental: Some capabilities (multi-step reasoning) require significantly larger models
References
- Socher, R., et al. (2013). Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP.
- Cobbe, K., et al. (2021). Training Verifiers to Solve Math Word Problems. arXiv:2110.14168.
- Rafailov, R., et al. (2023). Direct Preference Optimization. NeurIPS.