Overview
Direct Preference Optimization (DPO) is described in the paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model. It aligns a language model to human preferences using pairs of preferred and rejected completions, without requiring a separate reward model or RL training loop. DPO directly optimizes the model to widen the log-likelihood margin between preferred and rejected completions relative to a reference model. In practice, this is achieved by suppressing the likelihood of rejected completions rather than increasing the likelihood of preferred ones.Quick start
Dataset format
DPO requires a preference dataset withchosen and rejected fields. An explicit prompt field is recommended. Both standard and conversational formats are supported.
The beta parameter and reference model
The DPO loss is:beta parameter (default 0.1) controls how much the trained model is allowed to deviate from the reference model:
- Higher
beta: the model stays closer to the reference — less aggressive alignment. - Lower
beta: the model can deviate more — stronger preference signal but risk of over-optimization.
Loss types
DPOTrainer supports multiple loss formulations via the loss_type parameter. You can also combine multiple losses:
Available loss types
Available loss types
loss_type | Description |
|---|---|
"sigmoid" (default) | Standard DPO sigmoid loss (Bradley-Terry model) |
"hinge" | Hinge loss from RSO/SLiC; beta acts as reciprocal margin |
"ipo" | Identity Preference Optimization; avoids logit overfit |
"exo_pair" | Reverse-KL preference optimization (requires label_smoothing > 0) |
"nca_pair" | Optimizes absolute rather than relative likelihood |
"robust" | Unbiased DPO loss under noisy preference labels; use label_smoothing to model flip probability |
"bco_pair" | Binary classifier on (prompt, chosen) vs (prompt, rejected) pairs |
"sppo_hard" | Nash equilibrium approximation with hard label probabilities |
"aot" / "aot_unpaired" | Distributional alignment via Optimal Transport |
"apo_zero" / "apo_down" | Anchored objective variants |
"discopop" | Log-ratio modulated loss discovered by LLMs |
"sft" | Standard negative log-likelihood on preferred responses only |
Key configuration parameters
Core DPO parameters
Core DPO parameters
Controls deviation from the reference model. Higher values mean less deviation. For IPO (
loss_type="ipo"), this is the regularization parameter τ.Loss type(s) to use. Pass a list to combine multiple losses weighted by
loss_weights.Weights for each loss type when using multiple losses. Defaults to equal weights if not specified.
Label smoothing used in Robust DPO (probability of preference label flip, range
[0.0, 0.5)) and EXO (ε parameter).Reference model
Reference model
Precompute reference model log probabilities over the entire dataset before training starts, then discard the reference model. Saves memory during training.
Batch size to use when precomputing reference log probabilities. Can be set higher than the training batch size to speed up preprocessing.
Periodically synchronize the reference model with the active model using a mixup (TR-DPO). Not compatible with PEFT or
precompute_ref_log_probs=True.Data preprocessing
Data preprocessing
Maximum total sequence length (prompt + chosen/rejected). Sequences exceeding this are truncated.
Which end to truncate when a sequence exceeds
max_length. Options: "keep_start" or "keep_end".Perform forward passes without padding. Requires FlashAttention 2 or 3.
DPOConfig overrides some TrainingArguments defaults: logging_steps=10, gradient_checkpointing=True, bf16=True, and learning_rate=1e-6.Training with PEFT/LoRA
Training Vision-Language Models
SFT before DPO
DPO works best when the model is already capable of generating reasonable responses. A common pipeline is:SFT on preferred responses
Fine-tune the model on the
chosen completions from your preference dataset using SFTTrainer. This ensures the model can generate outputs in the expected format before DPO training.Logged metrics
| Metric | Description |
|---|---|
rewards/chosen | Average implicit reward for chosen completions: β·log(π_θ(y⁺)/π_ref(y⁺)) |
rewards/rejected | Average implicit reward for rejected completions |
rewards/margins | Average reward margin (chosen minus rejected) |
rewards/accuracies | Fraction of examples where chosen reward > rejected reward |
logps/chosen | Average log-probability on chosen completion tokens |
logps/rejected | Average log-probability on rejected completion tokens |
loss | Average DPO loss |