Verifier training

Verifier training is the fourth and final stage of the Modern LLM pipeline. It trains a separate encoder model to predict whether a solution to a math or reasoning problem is correct, enabling verification-guided inference and best-of-N sampling.

Overview

The verifier is a small transformer encoder that learns to score the correctness of (question, answer) pairs. Unlike the language model stages (pretrain/SFT/DPO), the verifier is:

Trained independently - Does not depend on other pipeline stages
Much smaller - Typically 4 layers, 512 hidden dim (~20M params)
Classification task - Binary prediction: correct (1) or incorrect (0)
Fast inference - Can score multiple candidate answers quickly

Use cases

Best-of-N sampling Generate N candidate answers, score each with the verifier, return the highest-scoring one. This significantly improves accuracy on math and reasoning tasks. Verification during search Use the verifier to guide beam search or tree search (e.g., in solution verification or code generation). Data filtering Score synthetic training data to filter out incorrect examples before fine-tuning. Process reward modeling Extend to score partial solutions for step-by-step verification (not yet implemented).

Dataset

The verifier is trained on GSM8K (Cobbe et al., 2021), a dataset of grade school math word problems:

Training set: 7,473 problems
Test set: 1,319 problems
Format: Question + step-by-step solution + final numeric answer

Synthetic negatives

Since GSM8K only provides correct solutions, we generate negative examples (wrong answers) using two strategies:

Answer substitution: Use the answer from a different problem
Perturbation: Modify the correct answer by ±1, ±10, ×0.5, ×2, etc.

This creates a balanced dataset with 50% positive and 50% negative examples.

The synthetic negatives are generated on-the-fly during dataset loading with controllable negative ratio (default: 1.0 = one negative per positive).

Usage

Using the pipeline runner (recommended)

python scripts/run_pipeline.py --config local --stage verifier

The verifier trains independently from pretrain/SFT/DPO. You can run it anytime without waiting for other stages.

Direct script usage

python scripts/train_verifier.py --config local

With custom hyperparameters:

python scripts/train_verifier.py \
    --max-steps 3000 \
    --lr 1e-4 \
    --batch-size 32 \
    --d-model 512 \
    --num-layers 4

Configuration

Config presets

# Quick test (~1 minute)
verifier_max_steps: 50
verifier_lr: 1e-4
verifier_batch_size: 32
verifier_micro_batch_size: 4
# Model size
d_model: 256
num_layers: 4
n_heads: 8
max_seq_len: 256

The verifier model is intentionally kept small (4 layers, 512 dim) for fast inference. Even this small model achieves 70-80% accuracy on GSM8K verification.

Model architecture

The verifier uses a standard transformer encoder:

class VerifierModel(nn.Module):
    def __init__(self, config: VerifierConfig):
        # Token embeddings
        self.embeddings = nn.Embedding(vocab_size, d_model)
        self.position_embeddings = nn.Embedding(max_position_embeddings, d_model)
        
        # Transformer encoder layers
        self.encoder = TransformerEncoder(
            d_model=config.d_model,
            num_layers=config.num_layers,
            n_heads=config.n_heads,
        )
        
        # Classification head
        self.classifier = nn.Linear(d_model, 2)  # [incorrect, correct]

Key differences from decoder LM:

Bidirectional attention (no causal masking)
CLS token pooling for classification
Small size for efficiency (~20M params vs 100M+ for LM)

Hyperparameter tuning

Learning rate (verifier_lr)

Default: 1e-4 (higher than LM training)
Verifier trains from scratch, so can use larger LR
Range: 5e-5 to 2e-4

Training steps (verifier_max_steps)

Default: 3000 steps
Usually converges in 2000-3000 steps
Monitor accuracy - should reach 70-80%

Model size

Default: d=512, L=4 (~20M params)
Larger models (d=768, L=6) give ~2-3% better accuracy
Smaller models (d=256, L=2) are faster but less accurate

Negative ratio (negative_ratio)

Default: 1.0 (one negative per positive)
Higher ratios (2.0, 3.0) create harder negatives
Lower ratios (0.5) focus on high-confidence examples

Training details

Optimization

Verifier training uses:

Optimizer: AdamW with β₁=0.9, β₂=0.99
Learning rate schedule: Cosine annealing from verifier_lr to 0
Gradient accumulation: Automatic (batch_size / micro_batch_size)
Mixed precision: BF16 on supported GPUs
Weight decay: 0.01

Loss function

Standard binary cross-entropy:

loss = F.cross_entropy(
    logits,          # (batch_size, 2)
    labels,          # (batch_size,) ∈ {0, 1}
    reduction='mean'
)

No class weighting needed since positives and negatives are balanced.

Data format

Examples are formatted as:

Question: A bakery sells 5 loaves of bread for $3 each. How much do they make?

Answer: 15

The verifier predicts:

1 (correct) if the answer matches the gold answer
0 (incorrect) if the answer is wrong

Metrics

Key metrics logged during training: Accuracy - Primary metric

% of examples correctly classified
Target: 70-80% on train, 65-75% on test
Random baseline: 50%

Loss - Binary cross-entropy

Should decrease from ~0.7 to ~0.3-0.4
Lower loss doesn’t always mean better accuracy

Checkpoints

Verifier saves checkpoints at regular intervals:

Regular checkpoints: Every save_every steps (default: 2000)
- Format: <run_name>-verifier_step{N}.pt
Final checkpoint: At end of training
- Format: <run_name>-verifier_final.pt

Checkpoint structure:

checkpoint = {
    'model_state': OrderedDict(...),      # Verifier encoder weights
    'optimizer_state': {...},             # Optimizer state
    'config': {...},                      # VerifierConfig
    'step': 3000,                         # Training step
    'run_name': 'local-full-verifier',
}

Monitoring

Verifier training progress:

Verifier: 20.3M parameters
Loading verifier dataset: gsm8k
Training examples: 14946 (7473 positive, 7473 negative)
Eval examples: 2638 (1319 positive, 1319 negative)
Starting verifier training for 3000 steps

Verifier Training: 100%|████████| 3000/3000 [0:58:30<00:00, loss=0.3456, acc=73.45%]
step=500 loss=0.5234 accuracy=68.23% lr=9.511e-05
step=1000 loss=0.4123 accuracy=71.45% lr=8.090e-05
step=1500 loss=0.3678 accuracy=73.12% lr=6.180e-05
step=2000 loss=0.3456 accuracy=73.89% lr=4.090e-05
step=2500 loss=0.3312 accuracy=74.23% lr=2.045e-05
step=3000 loss=0.3245 accuracy=74.56% lr=1.545e-06

eval step=3000 accuracy=72.34%
Verifier training complete. Final checkpoint: experiments/runs/local-full/verifier_final.pt

Quality indicators

Good verifier training:

Train accuracy reaches 70-80%
Eval accuracy is within 2-3% of train
Loss decreases smoothly

Signs of overfitting:

Train accuracy > 85% but eval accuracy < 70%
Large train/eval accuracy gap
Solution: Reduce steps, increase weight decay, or add regularization

Signs of underfitting:

Accuracy plateaus below 65%
Loss stays high (>0.5)
Solution: Train longer, increase model size, or check data quality

Implementation details

Verifier implementation is located at:

src/modern_llm/training/train_verifier.py - Training loop
src/modern_llm/models/verifier.py - Model architecture
scripts/train_verifier.py - CLI wrapper
scripts/run_pipeline.py:run_verifier() - Pipeline integration

Key functions: run_verifier_training(train_config, verifier_config, dataset_config, tokenizer_name) src/modern_llm/training/train_verifier.py:349-425 Main verifier entrypoint:

Initialize verifier model from scratch
Load GSM8K with synthetic negatives
Setup optimizer and trainer
Run training with eval
Save final checkpoint

VerifierDataset._generate_wrong_answer(correct, all_answers, idx) src/modern_llm/training/train_verifier.py:117-146 Generates synthetic negative examples:

Choose strategy (substitution or perturbation)
Apply transformation to correct answer
Return plausible but wrong answer

VerifierModel.forward(input_ids, attention_mask, labels) src/modern_llm/models/verifier.py Forward pass:

Embed tokens and add positional encodings
Pass through encoder layers
Pool to get sequence representation
Classify as correct/incorrect
Compute loss if labels provided

Using the verifier

Once trained, use the verifier for inference:

Best-of-N sampling

Generate multiple answers and return the best:

from modern_llm.models import VerifierModel
from modern_llm.utils.checkpointing import load_checkpoint

# Load verifier
ckpt = load_checkpoint("experiments/runs/local-full/verifier_final.pt")
verifier = VerifierModel.from_checkpoint(ckpt)

# Generate N candidates
question = "What is 15 × 23?"
candidates = [
    language_model.generate(question) for _ in range(8)
]

# Score each candidate
scores = [
    verifier.score(question, answer)
    for answer in candidates
]

# Return best
best_answer = candidates[np.argmax(scores)]

Verification API

# Score a single answer
score = verifier.score(
    question="What is 2 + 2?",
    answer="4"
)  # Returns probability ∈ [0, 1]

# Binary decision
is_correct = verifier.verify(
    question="What is 2 + 2?",
    answer="5"
)  # Returns True/False

Performance tips

Improve accuracy

Increase model size (d=768, L=6)
Train longer (5000 steps)
Use hard negatives (negative_ratio=2.0)
Add more data (other math datasets like MATH, MathQA)

Speed up inference

Use smaller model (d=256, L=2)
Quantize to int8 (not yet supported)
Batch multiple candidates together
Cache verifier on GPU for repeated use

Reduce memory usage

Lower micro_batch_size
Use gradient checkpointing
Reduce max_seq_len to 256
Use smaller model

Generalize beyond GSM8K

Train on multiple datasets (GSM8K + MATH + MathQA)
Use harder problems (AQuA, MATH Level 5)
Add code execution problems (HumanEval, MBPP)
Fine-tune on your specific domain

Evaluation

Evaluate the verifier on held-out test set:

python scripts/evaluate_pipeline.py \
    --checkpoint experiments/runs/local-full/verifier_final.pt \
    --stage verifier

This reports:

Accuracy: % of correct classifications
Precision: Of examples marked correct, % truly correct
Recall: Of truly correct examples, % marked correct
F1 score: Harmonic mean of precision and recall

Expected results

On GSM8K test set:

Small model (d=512, L=4): 70-75% accuracy
Medium model (d=768, L=6): 75-80% accuracy
Large model (d=1024, L=8): 78-82% accuracy

These numbers are for verifying final numeric answers. Process verification (scoring intermediate steps) is more challenging and typically achieves lower accuracy.

Next steps

After verifier training completes, you have a complete Modern LLM system:

Pretrained language model - General language understanding
SFT model - Instruction-following capability
DPO model - Preference-aligned responses
Verifier - Answer correctness scoring

You can now:

Deploy the DPO model with best-of-N sampling using the verifier
Run comprehensive evaluations on standard benchmarks
Iterate with more data and improved hyperparameters
Extend the verifier to other domains (code, reasoning, etc.)

Pipeline overview

Return to the pipeline overview for next steps

Get Started

Architecture

Training Pipeline

Guides

Overview

Use cases

Dataset

Synthetic negatives

Usage

Using the pipeline runner (recommended)

Direct script usage

Configuration

Config presets

Model architecture

Hyperparameter tuning

Training details

Optimization

Loss function

Data format

Metrics

Checkpoints

Monitoring

Quality indicators

Implementation details

Using the verifier

Best-of-N sampling

Verification API

Performance tips

Evaluation

Expected results

Next steps

Pipeline overview

Build docs developers (and LLMs) love

Get Started

Architecture

Training Pipeline

Guides

​Overview

​Use cases

​Dataset

​Synthetic negatives

​Usage

​Using the pipeline runner (recommended)

​Direct script usage

​Configuration

​Config presets

​Model architecture

​Hyperparameter tuning

​Training details

​Optimization

​Loss function

​Data format

​Metrics

​Checkpoints

​Monitoring

​Quality indicators

​Implementation details

​Using the verifier

​Best-of-N sampling

​Verification API

​Performance tips

​Evaluation

​Expected results

​Next steps

Pipeline overview

Build docs developers (and LLMs) love

Overview

Use cases

Dataset

Synthetic negatives

Usage

Using the pipeline runner (recommended)

Direct script usage

Configuration

Config presets

Model architecture

Hyperparameter tuning

Training details

Optimization

Loss function

Data format

Metrics

Checkpoints

Monitoring

Quality indicators

Implementation details

Using the verifier

Best-of-N sampling

Verification API

Performance tips

Evaluation

Expected results

Next steps