Evaluation

Modern LLM includes comprehensive evaluation tools for sentiment classification (SST-2), math reasoning (GSM8K), and custom tasks. Evaluation supports both scratch-trained models and HuggingFace baselines.

Quick start

Evaluate a trained model on all tasks:

# Evaluate single checkpoint
python scripts/evaluation/evaluate_tasks.py \
  --checkpoint experiments/my-run/final_checkpoint.pt \
  --max-sst2 500 \
  --max-gsm8k 100

# Evaluate with baseline comparison
python scripts/evaluation/evaluate_tasks.py \
  --checkpoint experiments/my-run/final_checkpoint.pt \
  --include-baselines

# Evaluate all pipeline stages
python scripts/evaluation/evaluate_tasks.py \
  --stage-checkpoints experiments/runs/gpu-full/

SST-2 sentiment classification

SST-2 evaluates binary sentiment classification using few-shot prompting.

Running SST-2 evaluation

Scratch model
HuggingFace model

python scripts/evaluation/eval_sst2.py \
  --checkpoint experiments/my-model/checkpoint.pt \
  --max-samples 500 \
  --output experiments/results/sst2_results.json

python scripts/evaluation/eval_sst2.py \
  --hf-model gpt2 \
  --max-samples 500 \
  --output experiments/results/sst2_gpt2.json

SST-2 from Python

import torch
from transformers import AutoTokenizer
from scripts.evaluation.eval_sst2 import (
    load_scratch_model,
    evaluate_sst2,
)

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model
model, tokenizer = load_scratch_model(
    "experiments/my-run/checkpoint.pt",
    device,
)

# Evaluate
results = evaluate_sst2(
    model=model,
    tokenizer=tokenizer,
    device=device,
    max_samples=500,
    is_hf_model=False,
)

print(f"SST-2 Accuracy: {results['accuracy']:.2%}")
print(f"Correct: {results['correct']}/{results['total']}")

Few-shot prompting

SST-2 evaluation uses a question-format prompt with examples:

FEW_SHOT_PROMPT = """Is this review positive or negative?

Review: "I love this movie, it's fantastic!"
Answer: positive

Review: "This was terrible and boring."
Answer: negative

Review: "A wonderful experience from start to finish."
Answer: positive

Review: "{text}"
Answer:"""

The model’s next-token logits for “positive” vs “negative” determine the prediction.

Why few-shot prompting?

Few-shot prompting achieves ~70% accuracy vs ~50% for simple prompts on GPT-2 scale models. The question format (“Is this positive or negative?”) helps the model understand the task structure.The predict_sentiment() function extracts logits for the “positive” and “negative” tokens and compares them:

# Get logits for last position
next_logits = logits[0, -1, :]

# Compare positive vs negative token logits
pos_tokens = tokenizer.encode(" positive", add_special_tokens=False)
neg_tokens = tokenizer.encode(" negative", add_special_tokens=False)

pos_prob = next_logits[pos_tokens[0]].item()
neg_prob = next_logits[neg_tokens[0]].item()

return "positive" if pos_prob > neg_prob else "negative"

GSM8K math reasoning

GSM8K evaluates grade-school math problem solving with chain-of-thought reasoning and optional verifier reranking.

Running GSM8K evaluation

Without verifier
With verifier

python scripts/evaluation/eval_gsm8k.py \
  --checkpoint experiments/my-model/checkpoint.pt \
  --max-samples 100 \
  --n-samples 1

python scripts/evaluation/eval_gsm8k.py \
  --checkpoint experiments/my-model/checkpoint.pt \
  --verifier experiments/verifier/checkpoint.pt \
  --max-samples 100 \
  --n-samples 8

GSM8K from Python

import torch
from scripts.evaluation.eval_gsm8k import (
    load_scratch_model,
    load_verifier,
    evaluate_gsm8k,
)

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and optional verifier
model, tokenizer = load_scratch_model(
    "experiments/sft/checkpoint.pt",
    device,
)

verifier = None
# verifier = load_verifier("experiments/verifier/checkpoint.pt", device)

# Evaluate
results = evaluate_gsm8k(
    model=model,
    tokenizer=tokenizer,
    device=device,
    verifier=verifier,
    max_samples=100,
    n_samples_per_q=8 if verifier else 1,
)

print(f"Exact Match: {results['exact_match_no_verifier']:.2%}")
if verifier:
    print(f"EM with verifier: {results['exact_match_with_verifier']:.2%}")
    print(f"Improvement: {results['verifier_improvement']:+.2%}")

Answer extraction

GSM8K uses multiple strategies to extract numeric answers:

def extract_answer(text: str) -> Optional[str]:
    """Extract final numeric answer from model output."""
    # Look for #### marker (GSM8K format)
    match = re.search(r"####\s*([\d,.-]+)", text)
    if match:
        return match.group(1).replace(",", "")
    
    # Look for "the answer is X" pattern
    match = re.search(r"answer is\s*([\d,.-]+)", text, re.IGNORECASE)
    if match:
        return match.group(1).replace(",", "")
    
    # Fall back to last number in text
    numbers = re.findall(r"[\d,]+\.?\d*", text)
    if numbers:
        return numbers[-1].replace(",", "")
    
    return None

Error taxonomy

GSM8K evaluation classifies errors into three categories:

Error types

Extraction errors: The correct answer appears in the output but wasn’t extracted
```
if gold_norm in model_output:
    return "extraction"
```

Arithmetic errors: The answer is close (within 20%) to the correct value

pred_num = float(pred_norm)
gold_num = float(gold_norm)
if abs(pred_num - gold_num) < abs(gold_num) * 0.2:
    return "arithmetic"

Reasoning errors: Wrong approach or logic (default category)

The error taxonomy helps identify what models struggle with:

{
  "error_taxonomy": {
    "extraction": 12,
    "arithmetic": 8,
    "reasoning": 35
  }
}

Verifier reranking

When a verifier is provided, multiple solutions are generated and reranked:

# Generate multiple solutions
solutions = generate_solutions(
    model, tokenizer, question, device, n_samples=8
)

# Score with verifier
scores = []
for sol in solutions:
    score = verifier.score(
        tokenizer(question + sol, return_tensors="pt")["input_ids"].to(device)
    )
    scores.append((score, sol))

# Select highest-scoring solution
best_sol = max(scores, key=lambda x: x[0])[1]
best_pred = extract_answer(best_sol)

Verifier reranking can improve accuracy by selecting the most promising solution from multiple generations. The improvement varies based on verifier quality.

Unified evaluation

The evaluate_tasks.py script evaluates models on all tasks and generates comparison tables.

Comparing pipeline stages

python scripts/evaluation/evaluate_tasks.py \
  --stage-checkpoints experiments/runs/gpu-full/ \
  --output-dir experiments/results/

This finds checkpoints for pretrain, SFT, and DPO stages and evaluates each:

Found checkpoints:
  pretrain: experiments/runs/gpu-full/gpu-full-pretrain/checkpoint_final.pt
  sft: experiments/runs/gpu-full/gpu-full-sft/checkpoint_final.pt
  dpo: experiments/runs/gpu-full/gpu-full-dpo/checkpoint_final.pt

Evaluating stage: pretrain
  Evaluating SST-2... 65.0% accuracy
  Evaluating GSM8K... 8.5% EM

Evaluating stage: sft
  Evaluating SST-2... 72.5% accuracy
  Evaluating GSM8K... 15.2% EM

Evaluating stage: dpo
  Evaluating SST-2... 75.0% accuracy
  Evaluating GSM8K... 16.8% EM

Generated outputs

The evaluation script generates three files:

task_metrics.json
baseline_comparison.md
stage_gains.md

[
  {
    "model": "gpu-full-pretrain",
    "stage": "pretrain",
    "is_hf_baseline": false,
    "sst2_accuracy": 0.650,
    "gsm8k_em": 0.085,
    "gsm8k_errors": {
      "extraction": 5,
      "arithmetic": 12,
      "reasoning": 68
    }
  },
  {
    "model": "gpu-full-sft",
    "stage": "sft",
    "sst2_accuracy": 0.725,
    "gsm8k_em": 0.152
  },
  ...
]

# Model Comparison

| Model | SST-2 Acc | GSM8K EM | Notes |
|-------|-----------|----------|-------|
| gpt2 | 68.0% | N/A | HF Baseline |
| distilgpt2 | 62.5% | N/A | HF Baseline |
| gpu-full-pretrain | 65.0% | 8.5% | pretrain |
| gpu-full-sft | 72.5% | 15.2% | sft |
| gpu-full-dpo | 75.0% | 16.8% | dpo |

# Stage-wise Gains

| Stage | SST-2 Acc | GSM8K EM | Δ SST-2 | Δ GSM8K |
|-------|-----------|----------|---------|----------|
| PRETRAIN | 65.0% | 8.5% | +0.0% | +0.0% |
| SFT | 72.5% | 15.2% | +7.5% | +6.7% |
| DPO | 75.0% | 16.8% | +2.5% | +1.6% |

Including baselines

python scripts/evaluation/evaluate_tasks.py \
  --checkpoint experiments/my-model/checkpoint.pt \
  --include-baselines

This evaluates GPT-2 and DistilGPT-2 for comparison:

Evaluating HF baseline: gpt2
  SST-2 Accuracy: 68.0%

Evaluating HF baseline: distilgpt2
  SST-2 Accuracy: 62.5%

Evaluating: experiments/my-model/checkpoint.pt
  SST-2 Accuracy: 72.5%
  GSM8K EM: 15.2%

Custom evaluation

Create custom evaluation scripts using the same loading utilities:

import torch
from pathlib import Path
from scripts.evaluation.eval_sst2 import load_scratch_model

def evaluate_custom_task(model, tokenizer, device):
    """Evaluate on custom task."""
    # Load your dataset
    dataset = ...
    
    correct = 0
    total = 0
    
    for example in dataset:
        prompt = format_prompt(example)
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        
        with torch.no_grad():
            outputs = model(inputs["input_ids"])
            prediction = process_output(outputs)
        
        if prediction == example["label"]:
            correct += 1
        total += 1
    
    return {"accuracy": correct / total}

# Load model
model, tokenizer = load_scratch_model("checkpoint.pt", "cuda")

# Evaluate
results = evaluate_custom_task(model, tokenizer, "cuda")
print(f"Custom task accuracy: {results['accuracy']:.2%}")

Complete example

Full evaluation pipeline:

import torch
from pathlib import Path
from scripts.evaluation.eval_sst2 import (
    load_scratch_model,
    evaluate_sst2,
)
from scripts.evaluation.eval_gsm8k import (
    evaluate_gsm8k,
    load_verifier,
)

def evaluate_model_comprehensive(
    checkpoint_path: str,
    verifier_path: str = None,
    output_dir: str = "experiments/results",
):
    """Run all evaluations for a model."""
    device = "cuda" if torch.cuda.is_available() else "cpu"
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Load model
    print(f"Loading model: {checkpoint_path}")
    model, tokenizer = load_scratch_model(checkpoint_path, device)
    
    # SST-2
    print("Evaluating SST-2...")
    sst2_results = evaluate_sst2(
        model, tokenizer, device,
        max_samples=500,
        is_hf_model=False,
    )
    print(f"  SST-2 Accuracy: {sst2_results['accuracy']:.2%}")
    
    # GSM8K
    print("Evaluating GSM8K...")
    verifier = None
    if verifier_path:
        verifier = load_verifier(verifier_path, device)
    
    gsm8k_results = evaluate_gsm8k(
        model, tokenizer, device,
        verifier=verifier,
        max_samples=100,
        n_samples_per_q=8 if verifier else 1,
    )
    print(f"  GSM8K EM: {gsm8k_results['exact_match_no_verifier']:.2%}")
    
    # Combine results
    results = {
        "checkpoint": checkpoint_path,
        "sst2": sst2_results,
        "gsm8k": gsm8k_results,
    }
    
    # Save
    import json
    with open(output_dir / "evaluation_results.json", "w") as f:
        json.dump(results, f, indent=2)
    
    print(f"\nResults saved to {output_dir}/evaluation_results.json")
    return results

# Run evaluation
evaluate_model_comprehensive(
    checkpoint_path="experiments/my-run/final_checkpoint.pt",
    verifier_path="experiments/verifier/checkpoint.pt",
)

Get Started

Architecture

Training Pipeline

Guides

Quick start

SST-2 sentiment classification

Running SST-2 evaluation

SST-2 from Python

Few-shot prompting

GSM8K math reasoning

Running GSM8K evaluation

GSM8K from Python

Answer extraction

Error taxonomy

Verifier reranking

Unified evaluation

Comparing pipeline stages

Generated outputs

Including baselines

Custom evaluation

Complete example

See also

Build docs developers (and LLMs) love

Get Started

Architecture

Training Pipeline

Guides

​Quick start

​SST-2 sentiment classification

​Running SST-2 evaluation

​SST-2 from Python

​Few-shot prompting

​GSM8K math reasoning

​Running GSM8K evaluation

​GSM8K from Python

​Answer extraction

​Error taxonomy

​Verifier reranking

​Unified evaluation

​Comparing pipeline stages

​Generated outputs

​Including baselines

​Custom evaluation

​Complete example

​See also

Build docs developers (and LLMs) love

Quick start

SST-2 sentiment classification

Running SST-2 evaluation

SST-2 from Python

Few-shot prompting

GSM8K math reasoning

Running GSM8K evaluation

GSM8K from Python

Answer extraction

Error taxonomy

Verifier reranking

Unified evaluation

Comparing pipeline stages

Generated outputs

Including baselines

Custom evaluation

Complete example

See also