Online methods
Offline methods
Reward modeling
Knowledge distillation
Online methods
Online methods generate completions at training time and use those completions — along with a reward signal — to update the policy. They generally require more compute per step than offline methods but can achieve better alignment by training on the model’s own distribution.GRPOTrainer — Group Relative Policy Optimization
GRPOTrainer — Group Relative Policy Optimization
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — introduces GRPO.
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — multi-stage pipeline using GRPO for reasoning.
- DAPO: An Open-Source LLM Reinforcement Learning System at Scale — overlong filtering, clip-higher, soft overlong punishment, and token-level loss.
- Dr. GRPO: Understanding R1-Zero-Like Training: A Critical Perspective — length-debiased GRPO variant.
- It Takes Two: Your GRPO Is Secretly DPO — formal connection between GRPO and DPO; 2-GRPO with
num_generations=2. - Group Sequence Policy Optimization — sequence-level importance sampling.
RLOOTrainer — REINFORCE Leave-One-Out
RLOOTrainer — REINFORCE Leave-One-Out
- Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs — introduces RLOO.
- REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models — global advantage normalization for training stability.
OnlineDPOTrainer — Online Direct Preference Optimization
OnlineDPOTrainer — Online Direct Preference Optimization
- Direct Language Model Alignment from Online AI Feedback — introduces online DPO with real-time AI feedback.
NashMDTrainer — Nash Mirror Descent
NashMDTrainer — Nash Mirror Descent
- Nash Learning from Human Feedback — introduces Nash-MD.
PPOTrainer — Proximal Policy Optimization
PPOTrainer — Proximal Policy Optimization
- Proximal Policy Optimization Algorithms — introduces PPO.
XPOTrainer — Exploratory Preference Optimization
XPOTrainer — Exploratory Preference Optimization
- Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF — introduces XPO.
Offline methods
Offline methods train on a fixed, pre-collected dataset. They are computationally lighter than online methods and simpler to set up, but may suffer from distribution shift between the training data and the model’s own generation distribution.SFTTrainer — Supervised Fine-Tuning
SFTTrainer — Supervised Fine-Tuning
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer — introduces sequence packing for efficient SFT.
- Fewer Truncations Improve Language Modeling — Best Fit Decreasing packing strategy to minimize truncation.
- On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification — Dynamic Fine-Tuning (DFT) with gradient rescaling.
DPOTrainer — Direct Preference Optimization
DPOTrainer — Direct Preference Optimization
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model — introduces DPO.
- A General Theoretical Paradigm to Understand Learning from Human Preferences — IPO loss variant to avoid preference overfitting.
- ORPO: Monolithic Preference Optimization without Reference Model — reference-free monolithic variant (see ORPOTrainer).
- Learn Your Reference Model for Real Good Alignment — Trust Region DPO with periodic reference model updates.
- Anchored Preference Optimization and Contrastive Revisions — APO objective for more contrastive preference pairs.
BCOTrainer — Binary Classifier Optimization
BCOTrainer — Binary Classifier Optimization
- Binary Classifier Optimization for Large Language Model Alignment — introduces BCO.
CPOTrainer — Contrastive Preference Optimization
CPOTrainer — Contrastive Preference Optimization
- Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation — introduces CPO.
- SimPO: Simple Preference Optimization with a Reference-Free Reward — reference-free CPO variant with target reward margin.
KTOTrainer — Kahneman–Tversky Optimization
KTOTrainer — Kahneman–Tversky Optimization
- KTO: Model Alignment as Prospect Theoretic Optimization — introduces KTO.
ORPOTrainer — Odds Ratio Preference Optimization
ORPOTrainer — Odds Ratio Preference Optimization
- ORPO: Monolithic Preference Optimization without Reference Model — introduces ORPO.
Reward modeling
Reward models score model outputs and provide the feedback signal used by online training methods.RewardTrainer — Outcome reward modeling
RewardTrainer — Outcome reward modeling
- Llama 2: Open Foundation and Fine-Tuned Chat Models — margin-based reward loss for multi-level preference ratings.
- Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking — auxiliary centering loss to reduce underdetermination.
PRMTrainer — Process reward modeling
PRMTrainer — Process reward modeling
- Solving math word problems with process- and outcome-based feedback — compares process-based vs outcome-based supervision; demonstrates the value of PRMs for reducing reasoning errors.
Knowledge distillation
Knowledge distillation methods train a smaller student model to mimic the output distribution of a larger teacher model, rather than training on hard labels.GKDTrainer — Generalized Knowledge Distillation
GKDTrainer — Generalized Knowledge Distillation
- On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes — introduces GKD with flexible divergence losses and on-policy student sampling.
MiniLLMTrainer — Sequence-level reverse KL distillation
MiniLLMTrainer — Sequence-level reverse KL distillation
- Knowledge Distillation of Large Language Models — introduces MiniLLM.
Choosing a method
Use the table below as a starting point. The right choice depends on your data, compute budget, and alignment goals.| Scenario | Recommended method |
|---|---|
| Instruction-following from demonstrations | SFTTrainer |
| Preference alignment with paired data (offline) | DPOTrainer |
| Preference alignment without a reference model | ORPOTrainer or CPOTrainer |
| Binary feedback (liked/disliked), no pairs | KTOTrainer or BCOTrainer |
| Online RL with rule-based rewards (e.g., math) | GRPOTrainer |
| Online RL with a reward model, critic-free | RLOOTrainer |
| Online RL with a full actor-critic setup | PPOTrainer |
| Scoring full completions | RewardTrainer |
| Scoring reasoning steps | PRMTrainer |
| Compressing a large model into a smaller one | GKDTrainer or MiniLLMTrainer |
use_vllm=True in the corresponding config.