Overview
DPO (Rafailov et al., 2023) is an alternative to RLHF that directly optimizes the model to satisfy human preferences without requiring:- A separate reward model
- Reinforcement learning (PPO)
- Complex training instability issues
DPO objective
Given pairs (x, y_w, y_l) where y_w is the preferred (chosen) response and y_l is the rejected response to prompt x:- π_θ = fine-tuning policy (your model)
- π_ref = reference policy (frozen SFT model)
- β = temperature parameter (default: 0.1)
- σ = sigmoid function
Modern LLM uses an implicit reference approach: we compute reference log-probs with
torch.no_grad() from the same model before computing gradients. This saves memory by not storing a separate reference model.Supported datasets
| Dataset | Pairs | Domain | Description |
|---|---|---|---|
| Anthropic/hh-rlhf | 161K | General helpfulness | Human preference data from Anthropic’s RLHF work |
| OpenAssistant/oasst1 | 88K | Multi-turn dialog | Community-generated preference conversations |
| stanfordnlp/SHP | 385K | Reddit comments | Social preferences from Reddit upvotes |
Usage
Using the pipeline runner (recommended)
Direct script usage
Configuration
Config presets
DPO learning rates are even lower than SFT (5e-6 vs 1e-5) because we’re making fine adjustments to an already well-tuned model.
Hyperparameter tuning
Beta (β) - Temperature parameter- Default:
0.1balances preference strength and diversity - Higher β (0.2-0.5): Stronger preference signal, more aggressive alignment
- Lower β (0.01-0.05): Gentler updates, preserves more SFT behavior
- Range:
0.01to0.5
dpo_lr)
- Default:
5e-6provides stable updates - Too high: Model diverges from SFT behavior
- Too low: Minimal preference learning
- Range:
1e-6to1e-5
dpo_max_steps)
- Default:
2000for local,3000for GPU - Monitor accuracy: should reach 60-70% preference accuracy
- Stop if accuracy plateaus or starts decreasing
dpo_batch_size)
- Default:
16(smaller than SFT due to memory overhead) - Each example requires 2 forward passes (chosen + rejected)
- Prefer gradient accumulation over larger micro batches
Training details
Optimization
DPO training uses:- Optimizer: AdamW with β₁=0.9, β₂=0.95
- Learning rate schedule: Cosine annealing from
dpo_lrto 0 - Gradient accumulation: Automatic (batch_size / micro_batch_size)
- Mixed precision: BF16 on supported GPUs
- Weight decay: 0.0 (disabled to preserve SFT model)
Training loop
For each preference pair:Compute reference log-probs
Forward pass with
torch.no_grad() to get π_ref(y_w|x) and π_ref(y_l|x). These serve as the implicit reference policy.Compute policy log-probs
Forward pass with gradients enabled to get π_θ(y_w|x) (rejected response is reused from reference).
Loss components
The DPO loss has two terms:- Preference term: Pushes chosen responses to be more likely
- KL penalty (implicit): Prevents model from drifting too far from reference
Metrics
Key metrics logged during training: Loss - DPO objective value- Should decrease from ~0.7 to ~0.3-0.5
- Sudden increases indicate instability (reduce LR or β)
- % of examples where p(chosen) > p(rejected)
- Target: 60-70% (random baseline is 50%)
- Above 80% may indicate overfitting
Checkpoints
DPO saves checkpoints at regular intervals:- Regular checkpoints: Every
save_everysteps (default: 2000)- Format:
<run_name>-dpo_step{N}.pt
- Format:
- Final checkpoint: At end of training
- Format:
<run_name>-dpo_final.pt - This is your final aligned model
- Format:
Monitoring
DPO training progress:Quality indicators
Good DPO training:- Accuracy improves steadily to 60-70%
- Loss decreases smoothly
- No sudden spikes or divergence
- Accuracy > 80%
- Model becomes conservative or repetitive
- Solution: Reduce steps, increase β, or add more data
- Accuracy stays near 50-55%
- Minimal improvement over SFT
- Solution: Increase steps, tune β, check data quality
- Loss spikes or NaN
- Accuracy fluctuates wildly
- Solution: Reduce LR (to 1e-6), lower β, or check for data issues
Implementation details
DPO implementation is located at:src/modern_llm/training/train_dpo.py- Main DPO trainingsrc/modern_llm/alignment/dpo_loss.py- DPO loss functionscripts/dpo.py- CLI wrapperscripts/run_pipeline.py:run_dpo()- Pipeline integration
src/modern_llm/training/train_dpo.py:355-416
Main DPO entrypoint:
- Load SFT model from checkpoint
- Load preference dataset pairs
- Setup DPO trainer with β parameter
- Run training loop with accuracy tracking
- Save final DPO checkpoint
src/modern_llm/alignment/dpo_loss.py
Computes DPO objective:
src/modern_llm/training/train_dpo.py:236-313
Executes one DPO training step:
- Compute reference log-probs (no grad)
- Compute policy log-probs for chosen (with grad)
- Calculate DPO loss
- Backward and accumulate gradients
- Update on full batch
Performance tips
Reduce memory usage
Reduce memory usage
- Use micro_batch_size=1 (DPO needs 2x forward passes per example)
- Enable gradient checkpointing
- Use implicit reference (default) instead of storing π_ref
- Reduce max_seq_len to 512
Speed up training
Speed up training
- Sample large preference datasets (e.g., hh-rlhf:50000)
- Increase micro_batch_size if GPU memory allows (2-4)
- Use shorter sequences
- Enable bf16 mixed precision
Improve alignment quality
Improve alignment quality
- Tune beta (try 0.05, 0.1, 0.2)
- Train longer (3000-5000 steps)
- Use multiple preference datasets
- Start with high-quality SFT model
Prevent model degradation
Prevent model degradation
- Keep learning rate low (5e-6 or lower)
- Use lower beta (0.05-0.1) for gentler updates
- Monitor accuracy - stop if > 75%
- Compare generations to SFT baseline
Evaluation
After DPO, evaluate alignment quality:Preference accuracy
Measure how often the model prefers chosen over rejected responses:Qualitative comparison
Compare DPO vs SFT on sample prompts:A/B testing
Present both model outputs to humans and measure preference:- Which response is more helpful?
- Which is more harmless?
- Which follows instructions better?
Next steps
After DPO completes, you have a fully aligned language model! Optional next steps:-
Train a verifier to score response quality:
- Run evaluations on standard benchmarks
- Deploy the model for inference
- Iterate: Collect more preference data and run additional DPO
Verifier training
Train a model to score answer correctness for math and QA tasks