run_verifier_training
from modern_llm.training.train_verifier import run_verifier_training
Train a verifier model to score the correctness of math problem solutions. The verifier is a small encoder-based classifier that predicts whether a solution is correct or incorrect.
Parameters
Training configuration with hyperparameters, batch sizes, and logging settings.
Verifier model architecture configuration specifying embedding dimension, layers, and attention heads.
dataset_config
VerifierDatasetConfig
required
Dataset configuration for verifier training data. Defaults to GSM8K with synthetic negatives.
HuggingFace tokenizer identifier used to tokenize question-answer pairs.
eval_split
Optional[str]
default:"None"
Optional evaluation split name (e.g., “test”). If provided, runs evaluation during training.
Returns
Path to the final verifier checkpoint.
Usage
from pathlib import Path
from modern_llm.config import TrainingConfig
from modern_llm.models.verifier import VerifierConfig
from modern_llm.training.train_verifier import (
run_verifier_training,
VerifierDatasetConfig,
)
# Configure verifier architecture
verifier_config = VerifierConfig(
vocab_size=50257, # Updated from tokenizer
d_model=512,
num_layers=4,
n_heads=8,
max_position_embeddings=512,
dropout=0.1,
)
# Configure training
train_config = TrainingConfig(
run_name="verifier-gsm8k",
dataset_name="gsm8k",
tokenizer_name="gpt2",
output_dir=Path("experiments/verifier"),
batch_size=32,
micro_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=1e-4,
max_steps=3000,
warmup_steps=100,
weight_decay=0.01,
eval_every=300,
save_every=1000,
log_every=50,
mixed_precision="bf16",
)
# Configure dataset
dataset_config = VerifierDatasetConfig(
dataset_name="gsm8k",
split="train",
max_length=512,
num_examples=7473, # Full GSM8K train set
negative_ratio=1.0, # 1 negative per positive
)
# Train verifier
verifier_ckpt = run_verifier_training(
train_config=train_config,
verifier_config=verifier_config,
dataset_config=dataset_config,
tokenizer_name="gpt2",
eval_split="test",
)
print(f"Verifier training complete: {verifier_ckpt}")
VerifierDatasetConfig
from modern_llm.training.train_verifier import VerifierDatasetConfig
Configuration for verifier training data.
Parameters
Name of the math dataset to use for verifier training. Currently supports GSM8K.
Dataset split to use (train or test).
Maximum sequence length for question + answer pairs. Longer sequences are truncated.
num_examples
Optional[int]
default:"None"
Maximum number of examples to use. If None, uses the entire dataset.
Number of negative (incorrect) examples to generate per positive (correct) example. Higher ratios create more balanced datasets but increase training time.
Usage
from modern_llm.training.train_verifier import VerifierDatasetConfig
# Default: balanced positive/negative
dataset_config = VerifierDatasetConfig(
dataset_name="gsm8k",
split="train",
max_length=512,
negative_ratio=1.0,
)
# More negatives for harder training
dataset_config = VerifierDatasetConfig(
dataset_name="gsm8k",
split="train",
negative_ratio=2.0, # 2 negatives per positive
)
# Small subset for debugging
dataset_config = VerifierDatasetConfig(
dataset_name="gsm8k",
split="train",
num_examples=100, # Only 100 examples
)
Verifier Training Details
Data Generation
The verifier trainer automatically generates training data from GSM8K:
Positive Examples:
- Question + correct answer (labeled as correct)
Negative Examples:
- Question + wrong answer (labeled as incorrect)
- Wrong answers are generated by:
- Using answers from other problems
- Perturbing correct answers (adding/multiplying by small amounts)
Each training example is formatted as:
Question: {math_problem}
Answer: {answer}
The verifier predicts binary label:
1 = correct solution
0 = incorrect solution
Model Architecture
The verifier is a transformer encoder with a classification head:
verifier_config = VerifierConfig(
d_model=512, # Hidden dimension
num_layers=4, # Encoder layers
n_heads=8, # Attention heads
max_position_embeddings=512,
)
Typically 10-50M parameters for efficient inference.
Training Configuration
Learning Rate
Verifiers train with moderate learning rates:
train_config = TrainingConfig(
learning_rate=1e-4,
warmup_steps=100,
max_steps=3000,
)
Batch Size
train_config = TrainingConfig(
batch_size=32,
micro_batch_size=4,
gradient_accumulation_steps=8,
)
Dataset Balance
Control positive/negative ratio:
# Balanced (1:1)
dataset_config = VerifierDatasetConfig(negative_ratio=1.0)
# More negatives (1:2)
dataset_config = VerifierDatasetConfig(negative_ratio=2.0)
Evaluation Metrics
The verifier trainer logs:
- Loss: Binary cross-entropy loss
- Accuracy: Percentage of correct predictions
# Training logs
step=100 loss=0.4321 accuracy=82.50% lr=1.000e-04
step=200 loss=0.3156 accuracy=87.75% lr=1.000e-04
# Evaluation logs
eval step=300 accuracy=85.23%
Target accuracy: 85-95% on held-out test set.
Using Trained Verifiers
After training, use the verifier to score solution candidates:
from modern_llm.models.verifier import VerifierModel
from modern_llm.utils.checkpointing import load_checkpoint
from transformers import AutoTokenizer
import torch
# Load verifier
ckpt = load_checkpoint("experiments/verifier/verifier_final.pt")
verifier = VerifierModel.from_config(ckpt["config"])
verifier.load_state_dict(ckpt["model_state"])
verifier.eval()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Score a solution
question = "If John has 5 apples and buys 3 more, how many does he have?"
answer = "8"
text = f"Question: {question}\n\nAnswer: {answer}"
tokens = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = verifier(**tokens)
probs = torch.softmax(outputs["logits"], dim=-1)
correctness_score = probs[0, 1].item() # Probability of correct
print(f"Correctness score: {correctness_score:.2%}")
Use Cases
Best-of-N Sampling
Generate multiple solutions and select the highest-scoring one:
# Generate N candidate solutions
candidates = [generate_solution(question) for _ in range(10)]
# Score each candidate
scores = [score_solution(verifier, question, ans) for ans in candidates]
# Select best
best_idx = max(range(len(scores)), key=lambda i: scores[i])
best_solution = candidates[best_idx]
Outcome-Based Reward Modeling
Use verifier scores as rewards for RL training:
# In RLHF training loop
solution = model.generate(question)
score = verifier.score(question, solution)
reward = score # Use as RL reward signal
Solution Filtering
Filter out low-confidence predictions:
solution = model.generate(question)
score = verifier.score(question, solution)
if score > 0.8:
# High confidence, use this solution
return solution
else:
# Low confidence, regenerate or abstain
return None