Overview
The verifier is a small transformer encoder that learns to score the correctness of (question, answer) pairs. Unlike the language model stages (pretrain/SFT/DPO), the verifier is:- Trained independently - Does not depend on other pipeline stages
- Much smaller - Typically 4 layers, 512 hidden dim (~20M params)
- Classification task - Binary prediction: correct (1) or incorrect (0)
- Fast inference - Can score multiple candidate answers quickly
Use cases
Best-of-N sampling Generate N candidate answers, score each with the verifier, return the highest-scoring one. This significantly improves accuracy on math and reasoning tasks. Verification during search Use the verifier to guide beam search or tree search (e.g., in solution verification or code generation). Data filtering Score synthetic training data to filter out incorrect examples before fine-tuning. Process reward modeling Extend to score partial solutions for step-by-step verification (not yet implemented).Dataset
The verifier is trained on GSM8K (Cobbe et al., 2021), a dataset of grade school math word problems:- Training set: 7,473 problems
- Test set: 1,319 problems
- Format: Question + step-by-step solution + final numeric answer
Synthetic negatives
Since GSM8K only provides correct solutions, we generate negative examples (wrong answers) using two strategies:- Answer substitution: Use the answer from a different problem
- Perturbation: Modify the correct answer by ±1, ±10, ×0.5, ×2, etc.
The synthetic negatives are generated on-the-fly during dataset loading with controllable negative ratio (default: 1.0 = one negative per positive).
Usage
Using the pipeline runner (recommended)
The verifier trains independently from pretrain/SFT/DPO. You can run it anytime without waiting for other stages.
Direct script usage
Configuration
Config presets
The verifier model is intentionally kept small (4 layers, 512 dim) for fast inference. Even this small model achieves 70-80% accuracy on GSM8K verification.
Model architecture
The verifier uses a standard transformer encoder:- Bidirectional attention (no causal masking)
- CLS token pooling for classification
- Small size for efficiency (~20M params vs 100M+ for LM)
Hyperparameter tuning
Learning rate (verifier_lr)
- Default:
1e-4(higher than LM training) - Verifier trains from scratch, so can use larger LR
- Range:
5e-5to2e-4
verifier_max_steps)
- Default:
3000steps - Usually converges in 2000-3000 steps
- Monitor accuracy - should reach 70-80%
- Default: d=512, L=4 (~20M params)
- Larger models (d=768, L=6) give ~2-3% better accuracy
- Smaller models (d=256, L=2) are faster but less accurate
negative_ratio)
- Default:
1.0(one negative per positive) - Higher ratios (2.0, 3.0) create harder negatives
- Lower ratios (0.5) focus on high-confidence examples
Training details
Optimization
Verifier training uses:- Optimizer: AdamW with β₁=0.9, β₂=0.99
- Learning rate schedule: Cosine annealing from
verifier_lrto 0 - Gradient accumulation: Automatic (batch_size / micro_batch_size)
- Mixed precision: BF16 on supported GPUs
- Weight decay: 0.01
Loss function
Standard binary cross-entropy:Data format
Examples are formatted as:- 1 (correct) if the answer matches the gold answer
- 0 (incorrect) if the answer is wrong
Metrics
Key metrics logged during training: Accuracy - Primary metric- % of examples correctly classified
- Target: 70-80% on train, 65-75% on test
- Random baseline: 50%
- Should decrease from ~0.7 to ~0.3-0.4
- Lower loss doesn’t always mean better accuracy
Checkpoints
Verifier saves checkpoints at regular intervals:- Regular checkpoints: Every
save_everysteps (default: 2000)- Format:
<run_name>-verifier_step{N}.pt
- Format:
- Final checkpoint: At end of training
- Format:
<run_name>-verifier_final.pt
- Format:
Monitoring
Verifier training progress:Quality indicators
Good verifier training:- Train accuracy reaches 70-80%
- Eval accuracy is within 2-3% of train
- Loss decreases smoothly
- Train accuracy > 85% but eval accuracy < 70%
- Large train/eval accuracy gap
- Solution: Reduce steps, increase weight decay, or add regularization
- Accuracy plateaus below 65%
- Loss stays high (>0.5)
- Solution: Train longer, increase model size, or check data quality
Implementation details
Verifier implementation is located at:src/modern_llm/training/train_verifier.py- Training loopsrc/modern_llm/models/verifier.py- Model architecturescripts/train_verifier.py- CLI wrapperscripts/run_pipeline.py:run_verifier()- Pipeline integration
src/modern_llm/training/train_verifier.py:349-425
Main verifier entrypoint:
- Initialize verifier model from scratch
- Load GSM8K with synthetic negatives
- Setup optimizer and trainer
- Run training with eval
- Save final checkpoint
src/modern_llm/training/train_verifier.py:117-146
Generates synthetic negative examples:
- Choose strategy (substitution or perturbation)
- Apply transformation to correct answer
- Return plausible but wrong answer
src/modern_llm/models/verifier.py
Forward pass:
- Embed tokens and add positional encodings
- Pass through encoder layers
- Pool to get sequence representation
- Classify as correct/incorrect
- Compute loss if labels provided
Using the verifier
Once trained, use the verifier for inference:Best-of-N sampling
Generate multiple answers and return the best:Verification API
Performance tips
Improve accuracy
Improve accuracy
- Increase model size (d=768, L=6)
- Train longer (5000 steps)
- Use hard negatives (negative_ratio=2.0)
- Add more data (other math datasets like MATH, MathQA)
Speed up inference
Speed up inference
- Use smaller model (d=256, L=2)
- Quantize to int8 (not yet supported)
- Batch multiple candidates together
- Cache verifier on GPU for repeated use
Reduce memory usage
Reduce memory usage
- Lower micro_batch_size
- Use gradient checkpointing
- Reduce max_seq_len to 256
- Use smaller model
Generalize beyond GSM8K
Generalize beyond GSM8K
- Train on multiple datasets (GSM8K + MATH + MathQA)
- Use harder problems (AQuA, MATH Level 5)
- Add code execution problems (HumanEval, MBPP)
- Fine-tune on your specific domain
Evaluation
Evaluate the verifier on held-out test set:- Accuracy: % of correct classifications
- Precision: Of examples marked correct, % truly correct
- Recall: Of truly correct examples, % marked correct
- F1 score: Harmonic mean of precision and recall
Expected results
On GSM8K test set:- Small model (d=512, L=4): 70-75% accuracy
- Medium model (d=768, L=6): 75-80% accuracy
- Large model (d=1024, L=8): 78-82% accuracy
These numbers are for verifying final numeric answers. Process verification (scoring intermediate steps) is more challenging and typically achieves lower accuracy.
Next steps
After verifier training completes, you have a complete Modern LLM system:- Pretrained language model - General language understanding
- SFT model - Instruction-following capability
- DPO model - Preference-aligned responses
- Verifier - Answer correctness scoring
- Deploy the DPO model with best-of-N sampling using the verifier
- Run comprehensive evaluations on standard benchmarks
- Iterate with more data and improved hyperparameters
- Extend the verifier to other domains (code, reasoning, etc.)
Pipeline overview
Return to the pipeline overview for next steps