Trainer and supports distributed training out of the box.
Install TRL
Trainers
Supervised Fine-Tuning with SFTTrainer
SFTTrainer is the starting point for most post-training workflows. It fine-tunes a model on a dataset of demonstrations.Reinforcement learning with GRPOTrainer
GRPOTrainer implements Group Relative Policy Optimization (GRPO) — a memory-efficient RL algorithm used to train DeepSeek-R1. It generates groups of completions and optimizes them against a reward function.For reasoning models, use the
reasoning_accuracy_reward() function for better results.Preference alignment with DPOTrainer
DPOTrainer implements Direct Preference Optimization (DPO), which trains the model directly on preference pairs without a separate reward model. DPO was used to post-train Llama 3 and many other models.Reward modeling with RewardTrainer
RewardTrainer trains a scalar reward model on preference data. Reward models are used as the reward signal for online RL methods like GRPO and RLOO.Command Line Interface
Thetrl CLI lets you run fine-tuning jobs directly from your terminal without writing any Python code.
SFT — supervised fine-tuning:
trl --help or any subcommand with --help to see all available options. See the CLI docs for the full reference.
Troubleshooting
Out of memory
Reduce batch size and accumulate gradients to maintain an effective batch size:Loss not decreasing
A learning rate that is too high or too low is a common cause. A good starting point for fine-tuning:Next steps
SFT Trainer
Full guide to supervised fine-tuning: packing, chat templates, and LoRA
GRPO Trainer
Group Relative Policy Optimization for reasoning and RL alignment
Distributed training
Scale to multi-GPU and multi-node with DeepSpeed and FSDP
PEFT integration
Train large models on consumer hardware with LoRA and QLoRA