Installation
Install TRL with pip or from source and set up your environment
Quickstart
Fine-tune your first model in minutes with SFT, DPO, GRPO, or a reward model
SFT Trainer
Supervised fine-tuning with packing, chat templates, and LoRA support
CLI
Fine-tune directly from your terminal without writing code
What is post-training?
Pre-trained language models learn general representations from large text corpora, but they require additional training to become useful assistants. Post-training adapts a foundation model to follow instructions, align with human preferences, and reason more accurately. TRL covers the full post-training pipeline.Trainer taxonomy
TRL organizes its trainers into four broad categories:Online methods
Online methods generate completions during training and optimize them with a reward signal. These methods are well-suited for tasks with verifiable answers or when a reward model is available.GRPOTrainer— Group Relative Policy Optimization. Trains the model by comparing groups of sampled completions against a reward function. Used to train DeepSeek-R1.RLOOTrainer— REINFORCE Leave-One-Out. A variance-reduced policy gradient algorithm.
Offline methods
Offline methods train on pre-collected preference or demonstration data without generating new completions at training time.SFTTrainer— Supervised Fine-Tuning. The standard starting point: train on curated demonstrations.DPOTrainer— Direct Preference Optimization. Optimizes the model directly from preference pairs, without a separate reward model.KTOTrainer— Kahneman-Tversky Optimization. Aligns models using binary feedback (thumbs up / thumbs down) instead of preference pairs.
Reward modeling
Reward models score completions and provide the signal used by online RL methods.RewardTrainer— Trains a scalar reward model on preference pairs.
Knowledge distillation
Distillation methods transfer capabilities from a larger teacher model to a smaller student model.BCOTrainer— Binary Classifier Optimization. Uses binary feedback for distillation.
Hugging Face ecosystem
TRL is built on top of and integrates natively with:- Transformers — model loading, tokenization, and training infrastructure. Every TRL trainer is a lightweight wrapper around the Transformers
Trainer. - Accelerate — distributed training across single GPU, multi-GPU (DDP), and multi-node (DeepSpeed ZeRO, FSDP) setups.
- PEFT — parameter-efficient fine-tuning via LoRA and QLoRA, enabling training of large models on modest hardware.
- Datasets — efficient dataset loading, processing, and streaming from the Hugging Face Hub.
All TRL trainers natively support distributed training methods including DDP, DeepSpeed ZeRO, and FSDP without any additional configuration.