DPO Trainer

Overview

Direct Preference Optimization (DPO) is described in the paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model. It aligns a language model to human preferences using pairs of preferred and rejected completions, without requiring a separate reward model or RL training loop. DPO directly optimizes the model to widen the log-likelihood margin between preferred and rejected completions relative to a reference model. In practice, this is achieved by suppressing the likelihood of rejected completions rather than increasing the likelihood of preferred ones.

Quick start

from trl import DPOTrainer
from datasets import load_dataset

trainer = DPOTrainer(
    model="Qwen/Qwen3-0.6B",
    train_dataset=load_dataset("trl-lib/ultrafeedback_binarized", split="train"),
)
trainer.train()

Dataset format

DPO requires a preference dataset with chosen and rejected fields. An explicit prompt field is recommended. Both standard and conversational formats are supported.

# Standard format — explicit prompt (recommended)
{"prompt": "The sky is", "chosen": " blue.", "rejected": " green."}

# Standard format — implicit prompt
{"chosen": "The sky is blue.", "rejected": "The sky is green."}

# Conversational format — explicit prompt (recommended)
{"prompt": [{"role": "user", "content": "What color is the sky?"}],
 "chosen": [{"role": "assistant", "content": "It is blue."}],
 "rejected": [{"role": "assistant", "content": "It is green."}]}

# Conversational format — implicit prompt
{"chosen": [{"role": "user", "content": "What color is the sky?"},
            {"role": "assistant", "content": "It is blue."}],
 "rejected": [{"role": "user", "content": "What color is the sky?"},
              {"role": "assistant", "content": "It is green."}]}

To convert a dataset with different column names:

from datasets import load_dataset

dataset = load_dataset("Vezora/Code-Preference-Pairs")

def preprocess_function(example):
    return {
        "prompt": [{"role": "user", "content": example["input"]}],
        "chosen": [{"role": "assistant", "content": example["accepted"]}],
        "rejected": [{"role": "assistant", "content": example["rejected"]}],
    }

dataset = dataset.map(preprocess_function, remove_columns=["instruction", "input", "accepted", "ID"])

The beta parameter and reference model

The DPO loss is:

L_DPO = -E[ log σ( β * (log π_θ(y⁺|x)/π_ref(y⁺|x) - log π_θ(y⁻|x)/π_ref(y⁻|x)) ) ]

The beta parameter (default 0.1) controls how much the trained model is allowed to deviate from the reference model:

Higher beta: the model stays closer to the reference — less aggressive alignment.
Lower beta: the model can deviate more — stronger preference signal but risk of over-optimization.

By default, the reference model is a frozen copy of the initial model. You can also precompute its log probabilities to save memory during training:

from trl import DPOConfig

training_args = DPOConfig(
    beta=0.1,
    precompute_ref_log_probs=True,  # compute reference logprobs once, then discard the ref model
)

Loss types

DPOTrainer supports multiple loss formulations via the loss_type parameter. You can also combine multiple losses:

from trl import DPOConfig

# Single loss type (default: sigmoid / Bradley-Terry model)
training_args = DPOConfig(loss_type="sigmoid")

# Multi-loss combination (MPO recipe)
training_args = DPOConfig(
    loss_type=["sigmoid", "bco_pair", "sft"],
    loss_weights=[0.8, 0.2, 1.0],
)

Available loss types

`loss_type`	Description
`"sigmoid"` (default)	Standard DPO sigmoid loss (Bradley-Terry model)
`"hinge"`	Hinge loss from RSO/SLiC; `beta` acts as reciprocal margin
`"ipo"`	Identity Preference Optimization; avoids logit overfit
`"exo_pair"`	Reverse-KL preference optimization (requires `label_smoothing > 0`)
`"nca_pair"`	Optimizes absolute rather than relative likelihood
`"robust"`	Unbiased DPO loss under noisy preference labels; use `label_smoothing` to model flip probability
`"bco_pair"`	Binary classifier on (prompt, chosen) vs (prompt, rejected) pairs
`"sppo_hard"`	Nash equilibrium approximation with hard label probabilities
`"aot"` / `"aot_unpaired"`	Distributional alignment via Optimal Transport
`"apo_zero"` / `"apo_down"`	Anchored objective variants
`"discopop"`	Log-ratio modulated loss discovered by LLMs
`"sft"`	Standard negative log-likelihood on preferred responses only

Key configuration parameters

Core DPO parameters

beta

float

default:"0.1"

Controls deviation from the reference model. Higher values mean less deviation. For IPO (loss_type="ipo"), this is the regularization parameter τ.

loss_type

list[str]

default:"sigmoid"

Loss type(s) to use. Pass a list to combine multiple losses weighted by loss_weights.

loss_weights

list[float] | None

Weights for each loss type when using multiple losses. Defaults to equal weights if not specified.

label_smoothing

float

default:"0.0"

Label smoothing used in Robust DPO (probability of preference label flip, range [0.0, 0.5)) and EXO (ε parameter).

Reference model

precompute_ref_log_probs

bool

default:"false"

Precompute reference model log probabilities over the entire dataset before training starts, then discard the reference model. Saves memory during training.

precompute_ref_batch_size

int | None

Batch size to use when precomputing reference log probabilities. Can be set higher than the training batch size to speed up preprocessing.

sync_ref_model

bool

default:"false"

Periodically synchronize the reference model with the active model using a mixup (TR-DPO). Not compatible with PEFT or precompute_ref_log_probs=True.

Data preprocessing

max_length

int | None

default:"1024"

Maximum total sequence length (prompt + chosen/rejected). Sequences exceeding this are truncated.

truncation_mode

str

default:"keep_start"

Which end to truncate when a sequence exceeds max_length. Options: "keep_start" or "keep_end".

padding_free

bool

default:"false"

Perform forward passes without padding. Requires FlashAttention 2 or 3.

DPOConfig overrides some TrainingArguments defaults: logging_steps=10, gradient_checkpointing=True, bf16=True, and learning_rate=1e-6.

Training with PEFT/LoRA

from datasets import load_dataset
from trl import DPOTrainer
from peft import LoraConfig

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = DPOTrainer(
    "Qwen/Qwen3-0.6B",
    train_dataset=dataset,
    peft_config=LoraConfig(),
)
trainer.train()

To continue training an existing PEFT model:

from datasets import load_dataset
from trl import DPOTrainer
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained("trl-lib/Qwen3-4B-LoRA", is_trainable=True)
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = DPOTrainer(
    model=model,
    train_dataset=dataset,
)
trainer.train()

When training adapters with DPO, use a learning rate around 1e-5 — slightly lower than for SFT adapters.

Training Vision-Language Models

from trl import DPOConfig, DPOTrainer
from datasets import load_dataset

trainer = DPOTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    args=DPOConfig(max_length=None),
    train_dataset=load_dataset("HuggingFaceH4/rlaif-v_formatted", split="train"),
)
trainer.train()

Set max_length=None for VLMs to prevent truncation from removing image tokens.

SFT before DPO

DPO works best when the model is already capable of generating reasonable responses. A common pipeline is:

SFT on preferred responses

Fine-tune the model on the chosen completions from your preference dataset using SFTTrainer. This ensures the model can generate outputs in the expected format before DPO training.

DPO alignment

Train the SFT-initialized model with DPOTrainer on the full preference dataset containing prompt, chosen, and rejected columns.

Logged metrics

Metric	Description
`rewards/chosen`	Average implicit reward for chosen completions: β·log(π_θ(y⁺)/π_ref(y⁺))
`rewards/rejected`	Average implicit reward for rejected completions
`rewards/margins`	Average reward margin (chosen minus rejected)
`rewards/accuracies`	Fraction of examples where chosen reward > rejected reward
`logps/chosen`	Average log-probability on chosen completion tokens
`logps/rejected`	Average log-probability on rejected completion tokens
`loss`	Average DPO loss

Get Started

Concepts

Trainers

How-to Guides

Integrations

Overview

Quick start

Dataset format

The beta parameter and reference model

Loss types

Key configuration parameters

Training with PEFT/LoRA

Training Vision-Language Models

SFT before DPO

Logged metrics

Build docs developers (and LLMs) love

Get Started

Concepts

Trainers

How-to Guides

Integrations

​Overview

​Quick start

​Dataset format

​The beta parameter and reference model

​Loss types

​Key configuration parameters

​Training with PEFT/LoRA

​Training Vision-Language Models

​SFT before DPO

​Logged metrics

Build docs developers (and LLMs) love

Overview

Quick start

Dataset format

The beta parameter and reference model

Loss types

Key configuration parameters

Training with PEFT/LoRA

Training Vision-Language Models

SFT before DPO

Logged metrics