Skip to main content

Overview

Direct Preference Optimization (DPO) is described in the paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model. It aligns a language model to human preferences using pairs of preferred and rejected completions, without requiring a separate reward model or RL training loop. DPO directly optimizes the model to widen the log-likelihood margin between preferred and rejected completions relative to a reference model. In practice, this is achieved by suppressing the likelihood of rejected completions rather than increasing the likelihood of preferred ones.

Quick start

from trl import DPOTrainer
from datasets import load_dataset

trainer = DPOTrainer(
    model="Qwen/Qwen3-0.6B",
    train_dataset=load_dataset("trl-lib/ultrafeedback_binarized", split="train"),
)
trainer.train()

Dataset format

DPO requires a preference dataset with chosen and rejected fields. An explicit prompt field is recommended. Both standard and conversational formats are supported.
# Standard format — explicit prompt (recommended)
{"prompt": "The sky is", "chosen": " blue.", "rejected": " green."}

# Standard format — implicit prompt
{"chosen": "The sky is blue.", "rejected": "The sky is green."}

# Conversational format — explicit prompt (recommended)
{"prompt": [{"role": "user", "content": "What color is the sky?"}],
 "chosen": [{"role": "assistant", "content": "It is blue."}],
 "rejected": [{"role": "assistant", "content": "It is green."}]}

# Conversational format — implicit prompt
{"chosen": [{"role": "user", "content": "What color is the sky?"},
            {"role": "assistant", "content": "It is blue."}],
 "rejected": [{"role": "user", "content": "What color is the sky?"},
              {"role": "assistant", "content": "It is green."}]}
To convert a dataset with different column names:
from datasets import load_dataset

dataset = load_dataset("Vezora/Code-Preference-Pairs")

def preprocess_function(example):
    return {
        "prompt": [{"role": "user", "content": example["input"]}],
        "chosen": [{"role": "assistant", "content": example["accepted"]}],
        "rejected": [{"role": "assistant", "content": example["rejected"]}],
    }

dataset = dataset.map(preprocess_function, remove_columns=["instruction", "input", "accepted", "ID"])

The beta parameter and reference model

The DPO loss is:
L_DPO = -E[ log σ( β * (log π_θ(y⁺|x)/π_ref(y⁺|x) - log π_θ(y⁻|x)/π_ref(y⁻|x)) ) ]
The beta parameter (default 0.1) controls how much the trained model is allowed to deviate from the reference model:
  • Higher beta: the model stays closer to the reference — less aggressive alignment.
  • Lower beta: the model can deviate more — stronger preference signal but risk of over-optimization.
By default, the reference model is a frozen copy of the initial model. You can also precompute its log probabilities to save memory during training:
from trl import DPOConfig

training_args = DPOConfig(
    beta=0.1,
    precompute_ref_log_probs=True,  # compute reference logprobs once, then discard the ref model
)

Loss types

DPOTrainer supports multiple loss formulations via the loss_type parameter. You can also combine multiple losses:
from trl import DPOConfig

# Single loss type (default: sigmoid / Bradley-Terry model)
training_args = DPOConfig(loss_type="sigmoid")

# Multi-loss combination (MPO recipe)
training_args = DPOConfig(
    loss_type=["sigmoid", "bco_pair", "sft"],
    loss_weights=[0.8, 0.2, 1.0],
)
loss_typeDescription
"sigmoid" (default)Standard DPO sigmoid loss (Bradley-Terry model)
"hinge"Hinge loss from RSO/SLiC; beta acts as reciprocal margin
"ipo"Identity Preference Optimization; avoids logit overfit
"exo_pair"Reverse-KL preference optimization (requires label_smoothing > 0)
"nca_pair"Optimizes absolute rather than relative likelihood
"robust"Unbiased DPO loss under noisy preference labels; use label_smoothing to model flip probability
"bco_pair"Binary classifier on (prompt, chosen) vs (prompt, rejected) pairs
"sppo_hard"Nash equilibrium approximation with hard label probabilities
"aot" / "aot_unpaired"Distributional alignment via Optimal Transport
"apo_zero" / "apo_down"Anchored objective variants
"discopop"Log-ratio modulated loss discovered by LLMs
"sft"Standard negative log-likelihood on preferred responses only

Key configuration parameters

beta
float
default:"0.1"
Controls deviation from the reference model. Higher values mean less deviation. For IPO (loss_type="ipo"), this is the regularization parameter τ.
loss_type
list[str]
default:"sigmoid"
Loss type(s) to use. Pass a list to combine multiple losses weighted by loss_weights.
loss_weights
list[float] | None
Weights for each loss type when using multiple losses. Defaults to equal weights if not specified.
label_smoothing
float
default:"0.0"
Label smoothing used in Robust DPO (probability of preference label flip, range [0.0, 0.5)) and EXO (ε parameter).
precompute_ref_log_probs
bool
default:"false"
Precompute reference model log probabilities over the entire dataset before training starts, then discard the reference model. Saves memory during training.
precompute_ref_batch_size
int | None
Batch size to use when precomputing reference log probabilities. Can be set higher than the training batch size to speed up preprocessing.
sync_ref_model
bool
default:"false"
Periodically synchronize the reference model with the active model using a mixup (TR-DPO). Not compatible with PEFT or precompute_ref_log_probs=True.
max_length
int | None
default:"1024"
Maximum total sequence length (prompt + chosen/rejected). Sequences exceeding this are truncated.
truncation_mode
str
default:"keep_start"
Which end to truncate when a sequence exceeds max_length. Options: "keep_start" or "keep_end".
padding_free
bool
default:"false"
Perform forward passes without padding. Requires FlashAttention 2 or 3.
DPOConfig overrides some TrainingArguments defaults: logging_steps=10, gradient_checkpointing=True, bf16=True, and learning_rate=1e-6.

Training with PEFT/LoRA

from datasets import load_dataset
from trl import DPOTrainer
from peft import LoraConfig

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = DPOTrainer(
    "Qwen/Qwen3-0.6B",
    train_dataset=dataset,
    peft_config=LoraConfig(),
)
trainer.train()
To continue training an existing PEFT model:
from datasets import load_dataset
from trl import DPOTrainer
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained("trl-lib/Qwen3-4B-LoRA", is_trainable=True)
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = DPOTrainer(
    model=model,
    train_dataset=dataset,
)
trainer.train()
When training adapters with DPO, use a learning rate around 1e-5 — slightly lower than for SFT adapters.

Training Vision-Language Models

from trl import DPOConfig, DPOTrainer
from datasets import load_dataset

trainer = DPOTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    args=DPOConfig(max_length=None),
    train_dataset=load_dataset("HuggingFaceH4/rlaif-v_formatted", split="train"),
)
trainer.train()
Set max_length=None for VLMs to prevent truncation from removing image tokens.

SFT before DPO

DPO works best when the model is already capable of generating reasonable responses. A common pipeline is:
1

SFT on preferred responses

Fine-tune the model on the chosen completions from your preference dataset using SFTTrainer. This ensures the model can generate outputs in the expected format before DPO training.
2

DPO alignment

Train the SFT-initialized model with DPOTrainer on the full preference dataset containing prompt, chosen, and rejected columns.

Logged metrics

MetricDescription
rewards/chosenAverage implicit reward for chosen completions: β·log(π_θ(y⁺)/π_ref(y⁺))
rewards/rejectedAverage implicit reward for rejected completions
rewards/marginsAverage reward margin (chosen minus rejected)
rewards/accuraciesFraction of examples where chosen reward > rejected reward
logps/chosenAverage log-probability on chosen completion tokens
logps/rejectedAverage log-probability on rejected completion tokens
lossAverage DPO loss

Build docs developers (and LLMs) love