Skip to main content
TRL provides a full stack of trainers for post-training language models. Methods are organized into four categories:

Online methods

Generate completions during training and update the policy based on rewards.

Offline methods

Train on a fixed dataset of pre-collected completions or preferences.

Reward modeling

Train reward or process reward models to score model outputs.

Knowledge distillation

Transfer knowledge from a teacher model to a smaller student model.

Online methods

Online methods generate completions at training time and use those completions — along with a reward signal — to update the policy. They generally require more compute per step than offline methods but can achieve better alignment by training on the model’s own distribution.
Status: Stable · vLLM supportedGRPO estimates advantages by comparing a group of completions sampled for the same prompt against each other, eliminating the need for a separate critic/value network. It was introduced in the DeepSeekMath paper and is the foundation of the DeepSeek-R1 training recipe.Key papers:When to use: When you want online RL with rule-based or model-based rewards and do not want to maintain a critic network. Strong choice for reasoning tasks (math, coding).Expected dataset type: Prompt-only
from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    loss_type="grpo",
    num_generations=8,
    beta=0.001,
)
trainer = GRPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    reward_funcs=[reward_fn],
)
trainer.train()
Status: Stable · vLLM supportedRLOO is a variance-reduced REINFORCE variant that uses leave-one-out baselines: each sample’s advantage is estimated by comparing it to the average of all other samples in the batch for the same prompt.Key papers:When to use: When you want a simple, critic-free online RL baseline with lower variance than standard REINFORCE.Expected dataset type: Prompt-only
from trl import RLOOConfig, RLOOTrainer

training_args = RLOOConfig(
    num_generations=4,
    beta=0.03,
)
trainer = RLOOTrainer(
    model=model,
    reward_model=reward_model,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()
Status: Experimental · vLLM supportedOnlineDPO generates candidate completions online and uses a judge or reward model to create preference pairs on the fly, then applies the DPO objective.Key papers:When to use: When you want the simplicity of DPO but prefer online data collection to avoid off-policy issues in static datasets.Expected dataset type: Prompt-only
Status: Experimental · vLLM supportedNash-MD frames preference learning as a two-player game and finds the Nash equilibrium policy using mirror descent. Instead of optimizing against a fixed reward model, it produces policies whose responses are consistently preferred over those of any competing policy.Key papers:When to use: When you want a game-theoretic alternative to RLHF that avoids reward over-optimization.Expected dataset type: Prompt-only
from trl.experimental.judges import PairRMJudge
from trl.experimental.nash_md import NashMDConfig, NashMDTrainer

trainer = NashMDTrainer(
    model=model,
    judge=PairRMJudge(),
    args=NashMDConfig(),
    processing_class=tokenizer,
    train_dataset=dataset,
)
trainer.train()
Status: ExperimentalPPO is a classic policy gradient algorithm that alternates between collecting rollouts and optimizing a clipped surrogate objective over multiple minibatch epochs. It requires a separate critic (value) network.Key papers:When to use: When you need a full actor-critic online RL setup and want a well-studied baseline.Expected dataset type: Tokenized language modeling
Status: Experimental · vLLM supportedXPO augments the online DPO objective with an exploration bonus that encourages the model to explore outside the support of the initial model and human feedback data.Key papers:When to use: When online DPO converges too quickly or fails to explore sufficiently.Expected dataset type: Prompt-only
from trl.experimental.xpo import XPOConfig

training_args = XPOConfig(
    alpha=1e-5,  # exploration bonus weight
    beta=0.1,    # KL regularization coefficient
)

Offline methods

Offline methods train on a fixed, pre-collected dataset. They are computationally lighter than online methods and simpler to set up, but may suffer from distribution shift between the training data and the model’s own generation distribution.
Status: StableSFT is the standard recipe for teaching a base or pre-trained model to follow instructions or adopt a conversational style. The model is trained with a cross-entropy loss on the target completions.Key papers:When to use: As a first training step for base models, or whenever you have high-quality demonstration data.Expected dataset type: Language modeling or prompt-completion
from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(),
    train_dataset=dataset,
)
trainer.train()
Status: StableDPO directly optimizes a language model to align with human preferences using a binary cross-entropy loss over chosen/rejected pairs, without needing an explicit reward model or RL loop.Key papers:When to use: After SFT, when you have paired preference data (chosen/rejected). One of the most widely used alignment methods.Expected dataset type: Preference (explicit prompt recommended)
from trl import DPOConfig, DPOTrainer

training_args = DPOConfig(
    loss_type="sigmoid",
    beta=0.1,
)
trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()
Status: ExperimentalBCO reframes alignment as behavioral cloning from a reward-weighted distribution, yielding simple supervised objectives that avoid RL while remaining theoretically grounded. It works with both unpaired binary feedback and pairwise preference data.Key papers:When to use: When you have binary (liked/disliked) labels rather than ranked preference pairs, or when you want a simpler alternative to KTO.Expected dataset type: Unpaired preference or preference (explicit prompt recommended)
Status: ExperimentalCPO trains the model to avoid adequate-but-imperfect completions, rather than just mimicking reference completions as in SFT. It also supports SimPO — a reference-free variant that uses average log probability as the implicit reward.Key papers:When to use: When you want preference optimization without a reference model, or specifically for translation tasks.Expected dataset type: Preference (explicit prompt recommended)
Status: ExperimentalKTO derives an alignment objective from prospect theory, learning directly from binary (liked/disliked) human feedback. It matches or surpasses DPO-style methods while handling imbalanced or noisy signals more gracefully.Key papers:When to use: When you have unpaired binary feedback rather than paired preference data, or when your preference data is imbalanced.Expected dataset type: Unpaired preference or preference (explicit prompt recommended)
from trl.experimental.kto import KTOConfig, KTOTrainer

trainer = KTOTrainer(
    model=model,
    processing_class=tokenizer,
    args=KTOConfig(),
    train_dataset=dataset,
)
trainer.train()
Status: ExperimentalORPO is a monolithic method that fuses SFT and preference optimization into a single training objective using an odds ratio penalty, without requiring a separate reference model.Key papers:When to use: When you want to do SFT and preference alignment in a single pass, with no reference model.Expected dataset type: Preference (explicit prompt recommended)
from trl.experimental.orpo import ORPOConfig, ORPOTrainer

training_args = ORPOConfig(
    beta=0.1,
    learning_rate=5e-6,
)

Reward modeling

Reward models score model outputs and provide the feedback signal used by online training methods.
Status: StableTrains a reward model on paired preference data (chosen/rejected) using a Bradley-Terry cross-entropy loss. The trained reward model scores full completions.Key papers:When to use: When you need a reward model to score outputs for an online RL method (e.g., RLOO or PPO).Expected dataset type: Preference (implicit prompt recommended)
from trl import RewardConfig, RewardTrainer

trainer = RewardTrainer(
    model=model,
    args=RewardConfig(),
    train_dataset=dataset,
)
trainer.train()
Status: ExperimentalTrains a process reward model (PRM) that provides per-step supervision, scoring each reasoning step rather than only the final answer. This is especially valuable for mathematical reasoning tasks.Key papers:When to use: When you need fine-grained step-level feedback rather than outcome-only feedback, particularly for chain-of-thought reasoning.Expected dataset type: Stepwise supervision
from trl.experimental.prm import PRMConfig, PRMTrainer

training_args = PRMConfig(
    step_separator="\n",
    train_on_last_step_only=False,
)

Knowledge distillation

Knowledge distillation methods train a smaller student model to mimic the output distribution of a larger teacher model, rather than training on hard labels.
Status: ExperimentalGKD addresses distribution mismatch in sequence-level knowledge distillation. Instead of training on teacher-generated sequences from a fixed dataset, the student generates its own completions and receives soft supervision from the teacher on those on-policy samples.Key papers:When to use: When you want to distill a large model into a smaller one and standard SFT on teacher-generated data leads to distribution shift.Expected dataset type: Prompt-completion
from trl.experimental.gkd import GKDConfig, GKDTrainer

training_args = GKDConfig(
    lmbda=0.5,   # fraction of batches where student generates completions
    beta=0.5,    # interpolation between forward-KL (0) and reverse-KL (1)
    temperature=1.0,
)
Status: ExperimentalMiniLLMTrainer is an on-policy knowledge distillation method that minimizes the sequence-level reverse KL divergence between the teacher and student, optimized with reinforcement learning. It generalizes on-policy distillation and can optionally incorporate single-step distribution-level distillation signals.Key papers:When to use: When you want principled on-policy distillation with sequence-level reverse KL, especially when the student cannot fully match the teacher’s distribution.
from trl.experimental.minillm import MiniLLMTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/tldr", split="train")

trainer = MiniLLMTrainer(
    model="Qwen/Qwen3-0.6B",
    teacher_model="Qwen/Qwen3-1.7B",
    train_dataset=dataset,
)
trainer.train()

Choosing a method

Use the table below as a starting point. The right choice depends on your data, compute budget, and alignment goals.
ScenarioRecommended method
Instruction-following from demonstrationsSFTTrainer
Preference alignment with paired data (offline)DPOTrainer
Preference alignment without a reference modelORPOTrainer or CPOTrainer
Binary feedback (liked/disliked), no pairsKTOTrainer or BCOTrainer
Online RL with rule-based rewards (e.g., math)GRPOTrainer
Online RL with a reward model, critic-freeRLOOTrainer
Online RL with a full actor-critic setupPPOTrainer
Scoring full completionsRewardTrainer
Scoring reasoning stepsPRMTrainer
Compressing a large model into a smaller oneGKDTrainer or MiniLLMTrainer
Trainers marked Experimental may have a less stable API and fewer guarantees around testing and consistency. They are fully functional but may change in future releases.
Several trainers support vLLM for accelerated rollout generation during online training. Trainers with vLLM support are noted in their sections above. To enable it, set use_vllm=True in the corresponding config.

Build docs developers (and LLMs) love