Quickstart

TRL provides dedicated trainer classes for every stage of the post-training pipeline. Each trainer is a lightweight wrapper around the Hugging Face Trainer and supports distributed training out of the box.

Install TRL

pip install trl

Trainers

Supervised Fine-Tuning with SFTTrainer

SFTTrainer is the starting point for most post-training workflows. It fine-tunes a model on a dataset of demonstrations.

from trl import SFTTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/Capybara", split="train")

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
)
trainer.train()

See the SFT Trainer docs for options like dataset packing, chat templates, and LoRA.

Reinforcement learning with GRPOTrainer

GRPOTrainer implements Group Relative Policy Optimization (GRPO) — a memory-efficient RL algorithm used to train DeepSeek-R1. It generates groups of completions and optimizes them against a reward function.

from datasets import load_dataset
from trl import GRPOTrainer
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

For reasoning models, use the reasoning_accuracy_reward() function for better results.

See the GRPO Trainer docs for reward function configuration and vLLM integration.

Preference alignment with DPOTrainer

DPOTrainer implements Direct Preference Optimization (DPO), which trains the model directly on preference pairs without a separate reward model. DPO was used to post-train Llama 3 and many other models.

from datasets import load_dataset
from trl import DPOTrainer

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = DPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    train_dataset=dataset,
)
trainer.train()

See the DPO Trainer docs for reference model configuration and loss variants.

Reward modeling with RewardTrainer

RewardTrainer trains a scalar reward model on preference data. Reward models are used as the reward signal for online RL methods like GRPO and RLOO.

from trl import RewardTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = RewardTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    train_dataset=dataset,
)
trainer.train()

See the Reward Trainer docs for dataset format and evaluation.

Command Line Interface

The trl CLI lets you run fine-tuning jobs directly from your terminal without writing any Python code. SFT — supervised fine-tuning:

trl sft --model_name_or_path Qwen/Qwen2.5-0.5B \
    --dataset_name trl-lib/Capybara \
    --output_dir Qwen2.5-0.5B-SFT

DPO — preference alignment:

trl dpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --dataset_name argilla/Capybara-Preferences \
    --output_dir Qwen2.5-0.5B-DPO

Reward modeling:

trl reward --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --dataset_name trl-lib/ultrafeedback_binarized

Run trl --help or any subcommand with --help to see all available options. See the CLI docs for the full reference.

Troubleshooting

Out of memory

Reduce batch size and accumulate gradients to maintain an effective batch size:

from trl import SFTConfig

training_args = SFTConfig(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
)

For more aggressive memory reduction, install PEFT and enable LoRA:

pip install "trl[peft,quantization]"

See the memory optimization guide and PEFT integration for details.

Loss not decreasing

A learning rate that is too high or too low is a common cause. A good starting point for fine-tuning:

from trl import SFTConfig

training_args = SFTConfig(learning_rate=2e-5)

For more help, open an issue on GitHub.

Next steps

SFT Trainer

Full guide to supervised fine-tuning: packing, chat templates, and LoRA

GRPO Trainer

Group Relative Policy Optimization for reasoning and RL alignment

Distributed training

Scale to multi-GPU and multi-node with DeepSpeed and FSDP

PEFT integration

Train large models on consumer hardware with LoRA and QLoRA

Get Started

Concepts

Trainers

How-to Guides

Integrations

Install TRL

Trainers

Command Line Interface

Troubleshooting

Out of memory

Loss not decreasing

Next steps

SFT Trainer

GRPO Trainer

Distributed training

PEFT integration

Build docs developers (and LLMs) love

Get Started

Concepts

Trainers

How-to Guides

Integrations

​Install TRL

​Trainers

​Command Line Interface

​Troubleshooting

​Out of memory

​Loss not decreasing

​Next steps

SFT Trainer

GRPO Trainer

Distributed training

PEFT integration

Build docs developers (and LLMs) love

Install TRL

Trainers

Command Line Interface

Troubleshooting

Out of memory

Loss not decreasing

Next steps