Skip to main content
TRL provides dedicated trainer classes for every stage of the post-training pipeline. Each trainer is a lightweight wrapper around the Hugging Face Trainer and supports distributed training out of the box.

Install TRL

pip install trl

Trainers

1

Supervised Fine-Tuning with SFTTrainer

SFTTrainer is the starting point for most post-training workflows. It fine-tunes a model on a dataset of demonstrations.
from trl import SFTTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/Capybara", split="train")

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
)
trainer.train()
See the SFT Trainer docs for options like dataset packing, chat templates, and LoRA.
2

Reinforcement learning with GRPOTrainer

GRPOTrainer implements Group Relative Policy Optimization (GRPO) — a memory-efficient RL algorithm used to train DeepSeek-R1. It generates groups of completions and optimizes them against a reward function.
from datasets import load_dataset
from trl import GRPOTrainer
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()
For reasoning models, use the reasoning_accuracy_reward() function for better results.
See the GRPO Trainer docs for reward function configuration and vLLM integration.
3

Preference alignment with DPOTrainer

DPOTrainer implements Direct Preference Optimization (DPO), which trains the model directly on preference pairs without a separate reward model. DPO was used to post-train Llama 3 and many other models.
from datasets import load_dataset
from trl import DPOTrainer

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = DPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    train_dataset=dataset,
)
trainer.train()
See the DPO Trainer docs for reference model configuration and loss variants.
4

Reward modeling with RewardTrainer

RewardTrainer trains a scalar reward model on preference data. Reward models are used as the reward signal for online RL methods like GRPO and RLOO.
from trl import RewardTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = RewardTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    train_dataset=dataset,
)
trainer.train()
See the Reward Trainer docs for dataset format and evaluation.

Command Line Interface

The trl CLI lets you run fine-tuning jobs directly from your terminal without writing any Python code. SFT — supervised fine-tuning:
trl sft --model_name_or_path Qwen/Qwen2.5-0.5B \
    --dataset_name trl-lib/Capybara \
    --output_dir Qwen2.5-0.5B-SFT
DPO — preference alignment:
trl dpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --dataset_name argilla/Capybara-Preferences \
    --output_dir Qwen2.5-0.5B-DPO
Reward modeling:
trl reward --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --dataset_name trl-lib/ultrafeedback_binarized
Run trl --help or any subcommand with --help to see all available options. See the CLI docs for the full reference.

Troubleshooting

Out of memory

Reduce batch size and accumulate gradients to maintain an effective batch size:
from trl import SFTConfig

training_args = SFTConfig(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
)
For more aggressive memory reduction, install PEFT and enable LoRA:
pip install "trl[peft,quantization]"
See the memory optimization guide and PEFT integration for details.

Loss not decreasing

A learning rate that is too high or too low is a common cause. A good starting point for fine-tuning:
from trl import SFTConfig

training_args = SFTConfig(learning_rate=2e-5)
For more help, open an issue on GitHub.

Next steps

SFT Trainer

Full guide to supervised fine-tuning: packing, chat templates, and LoRA

GRPO Trainer

Group Relative Policy Optimization for reasoning and RL alignment

Distributed training

Scale to multi-GPU and multi-node with DeepSpeed and FSDP

PEFT integration

Train large models on consumer hardware with LoRA and QLoRA

Build docs developers (and LLMs) love