Training methods

TRL provides a full stack of trainers for post-training language models. Methods are organized into four categories:

Online methods

Generate completions during training and update the policy based on rewards.

Offline methods

Train on a fixed dataset of pre-collected completions or preferences.

Reward modeling

Train reward or process reward models to score model outputs.

Knowledge distillation

Transfer knowledge from a teacher model to a smaller student model.

Online methods

Online methods generate completions at training time and use those completions — along with a reward signal — to update the policy. They generally require more compute per step than offline methods but can achieve better alignment by training on the model’s own distribution.

GRPOTrainer — Group Relative Policy Optimization

Status: Stable · vLLM supportedGRPO estimates advantages by comparing a group of completions sampled for the same prompt against each other, eliminating the need for a separate critic/value network. It was introduced in the DeepSeekMath paper and is the foundation of the DeepSeek-R1 training recipe.Key papers:

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — introduces GRPO.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — multi-stage pipeline using GRPO for reasoning.
DAPO: An Open-Source LLM Reinforcement Learning System at Scale — overlong filtering, clip-higher, soft overlong punishment, and token-level loss.
Dr. GRPO: Understanding R1-Zero-Like Training: A Critical Perspective — length-debiased GRPO variant.
It Takes Two: Your GRPO Is Secretly DPO — formal connection between GRPO and DPO; 2-GRPO with num_generations=2.
Group Sequence Policy Optimization — sequence-level importance sampling.

When to use: When you want online RL with rule-based or model-based rewards and do not want to maintain a critic network. Strong choice for reasoning tasks (math, coding).Expected dataset type: Prompt-only

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    loss_type="grpo",
    num_generations=8,
    beta=0.001,
)
trainer = GRPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    reward_funcs=[reward_fn],
)
trainer.train()

RLOOTrainer — REINFORCE Leave-One-Out

Status: Stable · vLLM supportedRLOO is a variance-reduced REINFORCE variant that uses leave-one-out baselines: each sample’s advantage is estimated by comparing it to the average of all other samples in the batch for the same prompt.Key papers:

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs — introduces RLOO.
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models — global advantage normalization for training stability.

When to use: When you want a simple, critic-free online RL baseline with lower variance than standard REINFORCE.Expected dataset type: Prompt-only

from trl import RLOOConfig, RLOOTrainer

training_args = RLOOConfig(
    num_generations=4,
    beta=0.03,
)
trainer = RLOOTrainer(
    model=model,
    reward_model=reward_model,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

OnlineDPOTrainer — Online Direct Preference Optimization

Status: Experimental · vLLM supportedOnlineDPO generates candidate completions online and uses a judge or reward model to create preference pairs on the fly, then applies the DPO objective.Key papers:

Direct Language Model Alignment from Online AI Feedback — introduces online DPO with real-time AI feedback.

When to use: When you want the simplicity of DPO but prefer online data collection to avoid off-policy issues in static datasets.Expected dataset type: Prompt-only

NashMDTrainer — Nash Mirror Descent

Status: Experimental · vLLM supportedNash-MD frames preference learning as a two-player game and finds the Nash equilibrium policy using mirror descent. Instead of optimizing against a fixed reward model, it produces policies whose responses are consistently preferred over those of any competing policy.Key papers:

Nash Learning from Human Feedback — introduces Nash-MD.

When to use: When you want a game-theoretic alternative to RLHF that avoids reward over-optimization.Expected dataset type: Prompt-only

from trl.experimental.judges import PairRMJudge
from trl.experimental.nash_md import NashMDConfig, NashMDTrainer

trainer = NashMDTrainer(
    model=model,
    judge=PairRMJudge(),
    args=NashMDConfig(),
    processing_class=tokenizer,
    train_dataset=dataset,
)
trainer.train()

PPOTrainer — Proximal Policy Optimization

Status: ExperimentalPPO is a classic policy gradient algorithm that alternates between collecting rollouts and optimizing a clipped surrogate objective over multiple minibatch epochs. It requires a separate critic (value) network.Key papers:

Proximal Policy Optimization Algorithms — introduces PPO.

When to use: When you need a full actor-critic online RL setup and want a well-studied baseline.Expected dataset type: Tokenized language modeling

XPOTrainer — Exploratory Preference Optimization

Status: Experimental · vLLM supportedXPO augments the online DPO objective with an exploration bonus that encourages the model to explore outside the support of the initial model and human feedback data.Key papers:

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF — introduces XPO.

When to use: When online DPO converges too quickly or fails to explore sufficiently.Expected dataset type: Prompt-only

from trl.experimental.xpo import XPOConfig

training_args = XPOConfig(
    alpha=1e-5,  # exploration bonus weight
    beta=0.1,    # KL regularization coefficient
)

Offline methods

Offline methods train on a fixed, pre-collected dataset. They are computationally lighter than online methods and simpler to set up, but may suffer from distribution shift between the training data and the model’s own generation distribution.

SFTTrainer — Supervised Fine-Tuning

Status: StableSFT is the standard recipe for teaching a base or pre-trained model to follow instructions or adopt a conversational style. The model is trained with a cross-entropy loss on the target completions.Key papers:

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer — introduces sequence packing for efficient SFT.
Fewer Truncations Improve Language Modeling — Best Fit Decreasing packing strategy to minimize truncation.
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification — Dynamic Fine-Tuning (DFT) with gradient rescaling.

When to use: As a first training step for base models, or whenever you have high-quality demonstration data.Expected dataset type: Language modeling or prompt-completion

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(),
    train_dataset=dataset,
)
trainer.train()

DPOTrainer — Direct Preference Optimization

Status: StableDPO directly optimizes a language model to align with human preferences using a binary cross-entropy loss over chosen/rejected pairs, without needing an explicit reward model or RL loop.Key papers:

Direct Preference Optimization: Your Language Model is Secretly a Reward Model — introduces DPO.
A General Theoretical Paradigm to Understand Learning from Human Preferences — IPO loss variant to avoid preference overfitting.
ORPO: Monolithic Preference Optimization without Reference Model — reference-free monolithic variant (see ORPOTrainer).
Learn Your Reference Model for Real Good Alignment — Trust Region DPO with periodic reference model updates.
Anchored Preference Optimization and Contrastive Revisions — APO objective for more contrastive preference pairs.

When to use: After SFT, when you have paired preference data (chosen/rejected). One of the most widely used alignment methods.Expected dataset type: Preference (explicit prompt recommended)

from trl import DPOConfig, DPOTrainer

training_args = DPOConfig(
    loss_type="sigmoid",
    beta=0.1,
)
trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

BCOTrainer — Binary Classifier Optimization

Status: ExperimentalBCO reframes alignment as behavioral cloning from a reward-weighted distribution, yielding simple supervised objectives that avoid RL while remaining theoretically grounded. It works with both unpaired binary feedback and pairwise preference data.Key papers:

Binary Classifier Optimization for Large Language Model Alignment — introduces BCO.

When to use: When you have binary (liked/disliked) labels rather than ranked preference pairs, or when you want a simpler alternative to KTO.Expected dataset type: Unpaired preference or preference (explicit prompt recommended)

CPOTrainer — Contrastive Preference Optimization

Status: ExperimentalCPO trains the model to avoid adequate-but-imperfect completions, rather than just mimicking reference completions as in SFT. It also supports SimPO — a reference-free variant that uses average log probability as the implicit reward.Key papers:

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation — introduces CPO.
SimPO: Simple Preference Optimization with a Reference-Free Reward — reference-free CPO variant with target reward margin.

When to use: When you want preference optimization without a reference model, or specifically for translation tasks.Expected dataset type: Preference (explicit prompt recommended)

KTOTrainer — Kahneman–Tversky Optimization

Status: ExperimentalKTO derives an alignment objective from prospect theory, learning directly from binary (liked/disliked) human feedback. It matches or surpasses DPO-style methods while handling imbalanced or noisy signals more gracefully.Key papers:

KTO: Model Alignment as Prospect Theoretic Optimization — introduces KTO.

When to use: When you have unpaired binary feedback rather than paired preference data, or when your preference data is imbalanced.Expected dataset type: Unpaired preference or preference (explicit prompt recommended)

from trl.experimental.kto import KTOConfig, KTOTrainer

trainer = KTOTrainer(
    model=model,
    processing_class=tokenizer,
    args=KTOConfig(),
    train_dataset=dataset,
)
trainer.train()

ORPOTrainer — Odds Ratio Preference Optimization

Status: ExperimentalORPO is a monolithic method that fuses SFT and preference optimization into a single training objective using an odds ratio penalty, without requiring a separate reference model.Key papers:

ORPO: Monolithic Preference Optimization without Reference Model — introduces ORPO.

When to use: When you want to do SFT and preference alignment in a single pass, with no reference model.Expected dataset type: Preference (explicit prompt recommended)

from trl.experimental.orpo import ORPOConfig, ORPOTrainer

training_args = ORPOConfig(
    beta=0.1,
    learning_rate=5e-6,
)

Reward modeling

Reward models score model outputs and provide the feedback signal used by online training methods.

RewardTrainer — Outcome reward modeling

Status: StableTrains a reward model on paired preference data (chosen/rejected) using a Bradley-Terry cross-entropy loss. The trained reward model scores full completions.Key papers:

Llama 2: Open Foundation and Fine-Tuned Chat Models — margin-based reward loss for multi-level preference ratings.
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking — auxiliary centering loss to reduce underdetermination.

When to use: When you need a reward model to score outputs for an online RL method (e.g., RLOO or PPO).Expected dataset type: Preference (implicit prompt recommended)

from trl import RewardConfig, RewardTrainer

trainer = RewardTrainer(
    model=model,
    args=RewardConfig(),
    train_dataset=dataset,
)
trainer.train()

PRMTrainer — Process reward modeling

Status: ExperimentalTrains a process reward model (PRM) that provides per-step supervision, scoring each reasoning step rather than only the final answer. This is especially valuable for mathematical reasoning tasks.Key papers:

Solving math word problems with process- and outcome-based feedback — compares process-based vs outcome-based supervision; demonstrates the value of PRMs for reducing reasoning errors.

When to use: When you need fine-grained step-level feedback rather than outcome-only feedback, particularly for chain-of-thought reasoning.Expected dataset type: Stepwise supervision

from trl.experimental.prm import PRMConfig, PRMTrainer

training_args = PRMConfig(
    step_separator="\n",
    train_on_last_step_only=False,
)

Knowledge distillation

Knowledge distillation methods train a smaller student model to mimic the output distribution of a larger teacher model, rather than training on hard labels.

GKDTrainer — Generalized Knowledge Distillation

Status: ExperimentalGKD addresses distribution mismatch in sequence-level knowledge distillation. Instead of training on teacher-generated sequences from a fixed dataset, the student generates its own completions and receives soft supervision from the teacher on those on-policy samples.Key papers:

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes — introduces GKD with flexible divergence losses and on-policy student sampling.

When to use: When you want to distill a large model into a smaller one and standard SFT on teacher-generated data leads to distribution shift.Expected dataset type: Prompt-completion

from trl.experimental.gkd import GKDConfig, GKDTrainer

training_args = GKDConfig(
    lmbda=0.5,   # fraction of batches where student generates completions
    beta=0.5,    # interpolation between forward-KL (0) and reverse-KL (1)
    temperature=1.0,
)

MiniLLMTrainer — Sequence-level reverse KL distillation

Status: ExperimentalMiniLLMTrainer is an on-policy knowledge distillation method that minimizes the sequence-level reverse KL divergence between the teacher and student, optimized with reinforcement learning. It generalizes on-policy distillation and can optionally incorporate single-step distribution-level distillation signals.Key papers:

Knowledge Distillation of Large Language Models — introduces MiniLLM.

When to use: When you want principled on-policy distillation with sequence-level reverse KL, especially when the student cannot fully match the teacher’s distribution.

from trl.experimental.minillm import MiniLLMTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/tldr", split="train")

trainer = MiniLLMTrainer(
    model="Qwen/Qwen3-0.6B",
    teacher_model="Qwen/Qwen3-1.7B",
    train_dataset=dataset,
)
trainer.train()

Choosing a method

Use the table below as a starting point. The right choice depends on your data, compute budget, and alignment goals.

Scenario	Recommended method
Instruction-following from demonstrations	`SFTTrainer`
Preference alignment with paired data (offline)	`DPOTrainer`
Preference alignment without a reference model	`ORPOTrainer` or `CPOTrainer`
Binary feedback (liked/disliked), no pairs	`KTOTrainer` or `BCOTrainer`
Online RL with rule-based rewards (e.g., math)	`GRPOTrainer`
Online RL with a reward model, critic-free	`RLOOTrainer`
Online RL with a full actor-critic setup	`PPOTrainer`
Scoring full completions	`RewardTrainer`
Scoring reasoning steps	`PRMTrainer`
Compressing a large model into a smaller one	`GKDTrainer` or `MiniLLMTrainer`

Trainers marked Experimental may have a less stable API and fewer guarantees around testing and consistency. They are fully functional but may change in future releases.

Several trainers support vLLM for accelerated rollout generation during online training. Trainers with vLLM support are noted in their sections above. To enable it, set use_vllm=True in the corresponding config.

Get Started

Concepts

Trainers

How-to Guides

Integrations

Online methods

Offline methods

Reward modeling

Knowledge distillation

Online methods

Offline methods

Reward modeling

Knowledge distillation

Choosing a method

Build docs developers (and LLMs) love

Get Started

Concepts

Trainers

How-to Guides

Integrations

Online methods

Offline methods

Reward modeling

Knowledge distillation

​Online methods

​Offline methods

​Reward modeling

​Knowledge distillation

​Choosing a method

Build docs developers (and LLMs) love

Online methods

Offline methods

Reward modeling

Knowledge distillation

Choosing a method