Skip to main content

run_verifier_training

from modern_llm.training.train_verifier import run_verifier_training
Train a verifier model to score the correctness of math problem solutions. The verifier is a small encoder-based classifier that predicts whether a solution is correct or incorrect.

Parameters

train_config
TrainingConfig
required
Training configuration with hyperparameters, batch sizes, and logging settings.
verifier_config
VerifierConfig
required
Verifier model architecture configuration specifying embedding dimension, layers, and attention heads.
dataset_config
VerifierDatasetConfig
required
Dataset configuration for verifier training data. Defaults to GSM8K with synthetic negatives.
tokenizer_name
str
default:"gpt2"
HuggingFace tokenizer identifier used to tokenize question-answer pairs.
eval_split
Optional[str]
default:"None"
Optional evaluation split name (e.g., “test”). If provided, runs evaluation during training.

Returns

checkpoint_path
Path
Path to the final verifier checkpoint.

Usage

from pathlib import Path
from modern_llm.config import TrainingConfig
from modern_llm.models.verifier import VerifierConfig
from modern_llm.training.train_verifier import (
    run_verifier_training,
    VerifierDatasetConfig,
)

# Configure verifier architecture
verifier_config = VerifierConfig(
    vocab_size=50257,  # Updated from tokenizer
    d_model=512,
    num_layers=4,
    n_heads=8,
    max_position_embeddings=512,
    dropout=0.1,
)

# Configure training
train_config = TrainingConfig(
    run_name="verifier-gsm8k",
    dataset_name="gsm8k",
    tokenizer_name="gpt2",
    output_dir=Path("experiments/verifier"),
    batch_size=32,
    micro_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=1e-4,
    max_steps=3000,
    warmup_steps=100,
    weight_decay=0.01,
    eval_every=300,
    save_every=1000,
    log_every=50,
    mixed_precision="bf16",
)

# Configure dataset
dataset_config = VerifierDatasetConfig(
    dataset_name="gsm8k",
    split="train",
    max_length=512,
    num_examples=7473,  # Full GSM8K train set
    negative_ratio=1.0,  # 1 negative per positive
)

# Train verifier
verifier_ckpt = run_verifier_training(
    train_config=train_config,
    verifier_config=verifier_config,
    dataset_config=dataset_config,
    tokenizer_name="gpt2",
    eval_split="test",
)

print(f"Verifier training complete: {verifier_ckpt}")

VerifierDatasetConfig

from modern_llm.training.train_verifier import VerifierDatasetConfig
Configuration for verifier training data.

Parameters

dataset_name
str
default:"gsm8k"
Name of the math dataset to use for verifier training. Currently supports GSM8K.
split
str
default:"train"
Dataset split to use (train or test).
max_length
int
default:"512"
Maximum sequence length for question + answer pairs. Longer sequences are truncated.
num_examples
Optional[int]
default:"None"
Maximum number of examples to use. If None, uses the entire dataset.
negative_ratio
float
default:"1.0"
Number of negative (incorrect) examples to generate per positive (correct) example. Higher ratios create more balanced datasets but increase training time.

Usage

from modern_llm.training.train_verifier import VerifierDatasetConfig

# Default: balanced positive/negative
dataset_config = VerifierDatasetConfig(
    dataset_name="gsm8k",
    split="train",
    max_length=512,
    negative_ratio=1.0,
)

# More negatives for harder training
dataset_config = VerifierDatasetConfig(
    dataset_name="gsm8k",
    split="train",
    negative_ratio=2.0,  # 2 negatives per positive
)

# Small subset for debugging
dataset_config = VerifierDatasetConfig(
    dataset_name="gsm8k",
    split="train",
    num_examples=100,  # Only 100 examples
)

Verifier Training Details

Data Generation

The verifier trainer automatically generates training data from GSM8K: Positive Examples:
  • Question + correct answer (labeled as correct)
Negative Examples:
  • Question + wrong answer (labeled as incorrect)
  • Wrong answers are generated by:
    1. Using answers from other problems
    2. Perturbing correct answers (adding/multiplying by small amounts)

Format

Each training example is formatted as:
Question: {math_problem}

Answer: {answer}
The verifier predicts binary label:
  • 1 = correct solution
  • 0 = incorrect solution

Model Architecture

The verifier is a transformer encoder with a classification head:
verifier_config = VerifierConfig(
    d_model=512,       # Hidden dimension
    num_layers=4,      # Encoder layers
    n_heads=8,         # Attention heads
    max_position_embeddings=512,
)
Typically 10-50M parameters for efficient inference.

Training Configuration

Learning Rate

Verifiers train with moderate learning rates:
train_config = TrainingConfig(
    learning_rate=1e-4,
    warmup_steps=100,
    max_steps=3000,
)

Batch Size

train_config = TrainingConfig(
    batch_size=32,
    micro_batch_size=4,
    gradient_accumulation_steps=8,
)

Dataset Balance

Control positive/negative ratio:
# Balanced (1:1)
dataset_config = VerifierDatasetConfig(negative_ratio=1.0)

# More negatives (1:2)
dataset_config = VerifierDatasetConfig(negative_ratio=2.0)

Evaluation Metrics

The verifier trainer logs:
  • Loss: Binary cross-entropy loss
  • Accuracy: Percentage of correct predictions
# Training logs
step=100 loss=0.4321 accuracy=82.50% lr=1.000e-04
step=200 loss=0.3156 accuracy=87.75% lr=1.000e-04

# Evaluation logs
eval step=300 accuracy=85.23%
Target accuracy: 85-95% on held-out test set.

Using Trained Verifiers

After training, use the verifier to score solution candidates:
from modern_llm.models.verifier import VerifierModel
from modern_llm.utils.checkpointing import load_checkpoint
from transformers import AutoTokenizer
import torch

# Load verifier
ckpt = load_checkpoint("experiments/verifier/verifier_final.pt")
verifier = VerifierModel.from_config(ckpt["config"])
verifier.load_state_dict(ckpt["model_state"])
verifier.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Score a solution
question = "If John has 5 apples and buys 3 more, how many does he have?"
answer = "8"

text = f"Question: {question}\n\nAnswer: {answer}"
tokens = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    outputs = verifier(**tokens)
    probs = torch.softmax(outputs["logits"], dim=-1)
    correctness_score = probs[0, 1].item()  # Probability of correct

print(f"Correctness score: {correctness_score:.2%}")

Use Cases

Best-of-N Sampling

Generate multiple solutions and select the highest-scoring one:
# Generate N candidate solutions
candidates = [generate_solution(question) for _ in range(10)]

# Score each candidate
scores = [score_solution(verifier, question, ans) for ans in candidates]

# Select best
best_idx = max(range(len(scores)), key=lambda i: scores[i])
best_solution = candidates[best_idx]

Outcome-Based Reward Modeling

Use verifier scores as rewards for RL training:
# In RLHF training loop
solution = model.generate(question)
score = verifier.score(question, solution)
reward = score  # Use as RL reward signal

Solution Filtering

Filter out low-confidence predictions:
solution = model.generate(question)
score = verifier.score(question, solution)

if score > 0.8:
    # High confidence, use this solution
    return solution
else:
    # Low confidence, regenerate or abstain
    return None

Build docs developers (and LLMs) love