Skip to main content
Modern LLM includes comprehensive evaluation tools for sentiment classification (SST-2), math reasoning (GSM8K), and custom tasks. Evaluation supports both scratch-trained models and HuggingFace baselines.

Quick start

Evaluate a trained model on all tasks:
# Evaluate single checkpoint
python scripts/evaluation/evaluate_tasks.py \
  --checkpoint experiments/my-run/final_checkpoint.pt \
  --max-sst2 500 \
  --max-gsm8k 100

# Evaluate with baseline comparison
python scripts/evaluation/evaluate_tasks.py \
  --checkpoint experiments/my-run/final_checkpoint.pt \
  --include-baselines

# Evaluate all pipeline stages
python scripts/evaluation/evaluate_tasks.py \
  --stage-checkpoints experiments/runs/gpu-full/

SST-2 sentiment classification

SST-2 evaluates binary sentiment classification using few-shot prompting.

Running SST-2 evaluation

python scripts/evaluation/eval_sst2.py \
  --checkpoint experiments/my-model/checkpoint.pt \
  --max-samples 500 \
  --output experiments/results/sst2_results.json

SST-2 from Python

import torch
from transformers import AutoTokenizer
from scripts.evaluation.eval_sst2 import (
    load_scratch_model,
    evaluate_sst2,
)

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model
model, tokenizer = load_scratch_model(
    "experiments/my-run/checkpoint.pt",
    device,
)

# Evaluate
results = evaluate_sst2(
    model=model,
    tokenizer=tokenizer,
    device=device,
    max_samples=500,
    is_hf_model=False,
)

print(f"SST-2 Accuracy: {results['accuracy']:.2%}")
print(f"Correct: {results['correct']}/{results['total']}")

Few-shot prompting

SST-2 evaluation uses a question-format prompt with examples:
FEW_SHOT_PROMPT = """Is this review positive or negative?

Review: "I love this movie, it's fantastic!"
Answer: positive

Review: "This was terrible and boring."
Answer: negative

Review: "A wonderful experience from start to finish."
Answer: positive

Review: "{text}"
Answer:"""
The model’s next-token logits for “positive” vs “negative” determine the prediction.
Few-shot prompting achieves ~70% accuracy vs ~50% for simple prompts on GPT-2 scale models. The question format (“Is this positive or negative?”) helps the model understand the task structure.The predict_sentiment() function extracts logits for the “positive” and “negative” tokens and compares them:
# Get logits for last position
next_logits = logits[0, -1, :]

# Compare positive vs negative token logits
pos_tokens = tokenizer.encode(" positive", add_special_tokens=False)
neg_tokens = tokenizer.encode(" negative", add_special_tokens=False)

pos_prob = next_logits[pos_tokens[0]].item()
neg_prob = next_logits[neg_tokens[0]].item()

return "positive" if pos_prob > neg_prob else "negative"

GSM8K math reasoning

GSM8K evaluates grade-school math problem solving with chain-of-thought reasoning and optional verifier reranking.

Running GSM8K evaluation

python scripts/evaluation/eval_gsm8k.py \
  --checkpoint experiments/my-model/checkpoint.pt \
  --max-samples 100 \
  --n-samples 1

GSM8K from Python

import torch
from scripts.evaluation.eval_gsm8k import (
    load_scratch_model,
    load_verifier,
    evaluate_gsm8k,
)

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and optional verifier
model, tokenizer = load_scratch_model(
    "experiments/sft/checkpoint.pt",
    device,
)

verifier = None
# verifier = load_verifier("experiments/verifier/checkpoint.pt", device)

# Evaluate
results = evaluate_gsm8k(
    model=model,
    tokenizer=tokenizer,
    device=device,
    verifier=verifier,
    max_samples=100,
    n_samples_per_q=8 if verifier else 1,
)

print(f"Exact Match: {results['exact_match_no_verifier']:.2%}")
if verifier:
    print(f"EM with verifier: {results['exact_match_with_verifier']:.2%}")
    print(f"Improvement: {results['verifier_improvement']:+.2%}")

Answer extraction

GSM8K uses multiple strategies to extract numeric answers:
def extract_answer(text: str) -> Optional[str]:
    """Extract final numeric answer from model output."""
    # Look for #### marker (GSM8K format)
    match = re.search(r"####\s*([\d,.-]+)", text)
    if match:
        return match.group(1).replace(",", "")
    
    # Look for "the answer is X" pattern
    match = re.search(r"answer is\s*([\d,.-]+)", text, re.IGNORECASE)
    if match:
        return match.group(1).replace(",", "")
    
    # Fall back to last number in text
    numbers = re.findall(r"[\d,]+\.?\d*", text)
    if numbers:
        return numbers[-1].replace(",", "")
    
    return None

Error taxonomy

GSM8K evaluation classifies errors into three categories:
  1. Extraction errors: The correct answer appears in the output but wasn’t extracted
    if gold_norm in model_output:
        return "extraction"
    
  2. Arithmetic errors: The answer is close (within 20%) to the correct value
    pred_num = float(pred_norm)
    gold_num = float(gold_norm)
    if abs(pred_num - gold_num) < abs(gold_num) * 0.2:
        return "arithmetic"
    
  3. Reasoning errors: Wrong approach or logic (default category)
The error taxonomy helps identify what models struggle with:
{
  "error_taxonomy": {
    "extraction": 12,
    "arithmetic": 8,
    "reasoning": 35
  }
}

Verifier reranking

When a verifier is provided, multiple solutions are generated and reranked:
# Generate multiple solutions
solutions = generate_solutions(
    model, tokenizer, question, device, n_samples=8
)

# Score with verifier
scores = []
for sol in solutions:
    score = verifier.score(
        tokenizer(question + sol, return_tensors="pt")["input_ids"].to(device)
    )
    scores.append((score, sol))

# Select highest-scoring solution
best_sol = max(scores, key=lambda x: x[0])[1]
best_pred = extract_answer(best_sol)
Verifier reranking can improve accuracy by selecting the most promising solution from multiple generations. The improvement varies based on verifier quality.

Unified evaluation

The evaluate_tasks.py script evaluates models on all tasks and generates comparison tables.

Comparing pipeline stages

python scripts/evaluation/evaluate_tasks.py \
  --stage-checkpoints experiments/runs/gpu-full/ \
  --output-dir experiments/results/
This finds checkpoints for pretrain, SFT, and DPO stages and evaluates each:
Found checkpoints:
  pretrain: experiments/runs/gpu-full/gpu-full-pretrain/checkpoint_final.pt
  sft: experiments/runs/gpu-full/gpu-full-sft/checkpoint_final.pt
  dpo: experiments/runs/gpu-full/gpu-full-dpo/checkpoint_final.pt

Evaluating stage: pretrain
  Evaluating SST-2... 65.0% accuracy
  Evaluating GSM8K... 8.5% EM

Evaluating stage: sft
  Evaluating SST-2... 72.5% accuracy
  Evaluating GSM8K... 15.2% EM

Evaluating stage: dpo
  Evaluating SST-2... 75.0% accuracy
  Evaluating GSM8K... 16.8% EM

Generated outputs

The evaluation script generates three files:
[
  {
    "model": "gpu-full-pretrain",
    "stage": "pretrain",
    "is_hf_baseline": false,
    "sst2_accuracy": 0.650,
    "gsm8k_em": 0.085,
    "gsm8k_errors": {
      "extraction": 5,
      "arithmetic": 12,
      "reasoning": 68
    }
  },
  {
    "model": "gpu-full-sft",
    "stage": "sft",
    "sst2_accuracy": 0.725,
    "gsm8k_em": 0.152
  },
  ...
]

Including baselines

python scripts/evaluation/evaluate_tasks.py \
  --checkpoint experiments/my-model/checkpoint.pt \
  --include-baselines
This evaluates GPT-2 and DistilGPT-2 for comparison:
Evaluating HF baseline: gpt2
  SST-2 Accuracy: 68.0%

Evaluating HF baseline: distilgpt2
  SST-2 Accuracy: 62.5%

Evaluating: experiments/my-model/checkpoint.pt
  SST-2 Accuracy: 72.5%
  GSM8K EM: 15.2%

Custom evaluation

Create custom evaluation scripts using the same loading utilities:
import torch
from pathlib import Path
from scripts.evaluation.eval_sst2 import load_scratch_model

def evaluate_custom_task(model, tokenizer, device):
    """Evaluate on custom task."""
    # Load your dataset
    dataset = ...
    
    correct = 0
    total = 0
    
    for example in dataset:
        prompt = format_prompt(example)
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        
        with torch.no_grad():
            outputs = model(inputs["input_ids"])
            prediction = process_output(outputs)
        
        if prediction == example["label"]:
            correct += 1
        total += 1
    
    return {"accuracy": correct / total}

# Load model
model, tokenizer = load_scratch_model("checkpoint.pt", "cuda")

# Evaluate
results = evaluate_custom_task(model, tokenizer, "cuda")
print(f"Custom task accuracy: {results['accuracy']:.2%}")

Complete example

Full evaluation pipeline:
import torch
from pathlib import Path
from scripts.evaluation.eval_sst2 import (
    load_scratch_model,
    evaluate_sst2,
)
from scripts.evaluation.eval_gsm8k import (
    evaluate_gsm8k,
    load_verifier,
)

def evaluate_model_comprehensive(
    checkpoint_path: str,
    verifier_path: str = None,
    output_dir: str = "experiments/results",
):
    """Run all evaluations for a model."""
    device = "cuda" if torch.cuda.is_available() else "cpu"
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Load model
    print(f"Loading model: {checkpoint_path}")
    model, tokenizer = load_scratch_model(checkpoint_path, device)
    
    # SST-2
    print("Evaluating SST-2...")
    sst2_results = evaluate_sst2(
        model, tokenizer, device,
        max_samples=500,
        is_hf_model=False,
    )
    print(f"  SST-2 Accuracy: {sst2_results['accuracy']:.2%}")
    
    # GSM8K
    print("Evaluating GSM8K...")
    verifier = None
    if verifier_path:
        verifier = load_verifier(verifier_path, device)
    
    gsm8k_results = evaluate_gsm8k(
        model, tokenizer, device,
        verifier=verifier,
        max_samples=100,
        n_samples_per_q=8 if verifier else 1,
    )
    print(f"  GSM8K EM: {gsm8k_results['exact_match_no_verifier']:.2%}")
    
    # Combine results
    results = {
        "checkpoint": checkpoint_path,
        "sst2": sst2_results,
        "gsm8k": gsm8k_results,
    }
    
    # Save
    import json
    with open(output_dir / "evaluation_results.json", "w") as f:
        json.dump(results, f, indent=2)
    
    print(f"\nResults saved to {output_dir}/evaluation_results.json")
    return results

# Run evaluation
evaluate_model_comprehensive(
    checkpoint_path="experiments/my-run/final_checkpoint.pt",
    verifier_path="experiments/verifier/checkpoint.pt",
)

See also

Build docs developers (and LLMs) love