Skip to main content
TRL provides a command-line interface (CLI) to fine-tune large language models using methods like SFT, DPO, GRPO, and more. The CLI abstracts away boilerplate so you can launch training jobs quickly and reproducibly.

Available commands

trl sft

Supervised fine-tuning

trl dpo

Direct Preference Optimization

trl grpo

Group Relative Policy Optimization

trl rloo

REINFORCE Leave-One-Out

trl kto

Kahneman-Tversky Optimization

trl reward

Reward model training
Other commands:
  • trl env — print system and dependency information
  • trl vllm-serve — start a vLLM generation server
  • trl skills — manage TRL agent skills

Basic usage

Specify the model and dataset directly as flags:
trl sft \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name stanfordnlp/imdb

Key flags

Model flags (ModelConfig)

FlagDefaultDescription
--model_name_or_pathModel checkpoint or Hub ID
--model_revisionmainBranch, tag, or commit ID
--dtypefloat32Model dtype: auto, bfloat16, float16, float32
--attn_implementationAttention backend (e.g. flash_attention_2, kernels-community/flash-attn2)
--trust_remote_codefalseAllow custom model code from the Hub
--use_peftfalseEnable PEFT/LoRA training
--lora_r16LoRA rank
--lora_alpha32LoRA scaling factor
--lora_dropout0.05LoRA dropout
--lora_target_modulesModules to apply LoRA to
--load_in_4bitfalseLoad base model in 4-bit (QLoRA)
--load_in_8bitfalseLoad base model in 8-bit
--bnb_4bit_quant_typenf44-bit quantization type: nf4 or fp4

Training flags (shared across trainers)

FlagDescription
--output_dirDirectory to save the trained model
--learning_rateLearning rate
--num_train_epochsNumber of training epochs
--max_stepsMaximum number of training steps (overrides epochs)
--per_device_train_batch_sizeBatch size per GPU
--gradient_accumulation_stepsSteps to accumulate gradients before updating
--bf16Enable bfloat16 mixed precision
--fp16Enable float16 mixed precision
--eval_strategyEvaluation strategy: no, steps, epoch
--eval_stepsEvaluate every N steps (when eval_strategy=steps)
--push_to_hubPush trained model to the Hugging Face Hub
--gradient_checkpointingEnable gradient checkpointing

SFT-specific flags

FlagDescription
--max_lengthMaximum sequence length for truncation
--packingEnable sequence packing
--packing_strategyPacking strategy: bfd, bfd_split, or wrapped
--eos_tokenEOS token string (e.g. <|im_end|>)

DPO-specific flags

FlagDescription
--max_lengthMaximum combined prompt+completion length
--betaKL penalty coefficient
--loss_typeDPO loss type (e.g. sigmoid, hinge, ipo)

GRPO-specific flags

FlagDescription
--reward_funcsBuilt-in reward functions to use (e.g. accuracy_reward, think_format_reward)
--reward_model_name_or_pathExternal reward model Hub ID or local path
--use_vllmEnable vLLM for fast generation
--vllm_modevLLM mode: server
Built-in reward_funcs values for GRPO and RLOO:
  • accuracy_reward
  • reasoning_accuracy_reward
  • think_format_reward
  • get_soft_overlong_punishment
  • Any dotted import path (e.g. my_lib.rewards.custom_reward)

Using config files

Define all training arguments in a YAML config file for cleaner, reproducible runs:
# sft_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: stanfordnlp/imdb
learning_rate: 2.0e-5
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
packing: true
output_dir: Qwen2.5-0.5B-SFT
trl sft --config sft_config.yaml
CLI flags passed alongside --config override values in the file.

Multi-GPU and distributed training

The TRL CLI natively supports Accelerate. Pass any accelerate launch argument directly, such as --num_processes:
trl sft \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name stanfordnlp/imdb \
  --num_processes 4

Using --accelerate_config

The --accelerate_config flag selects a distributed training strategy. It accepts either a predefined profile name or a path to a custom Accelerate YAML config file. Predefined profiles:
NameDescription
single_gpuSingle-GPU training
multi_gpuMulti-GPU with DDP
fsdp1Fully Sharded Data Parallel Stage 1
fsdp2Fully Sharded Data Parallel Stage 2
zero1DeepSpeed ZeRO Stage 1
zero2DeepSpeed ZeRO Stage 2
zero3DeepSpeed ZeRO Stage 3
trl sft \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name stanfordnlp/imdb \
  --accelerate_config zero2
Or in a config file:
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: stanfordnlp/imdb
accelerate_config: zero2

Dataset mixtures

Combine multiple datasets into a single training dataset using the datasets key in your config file:
model_name_or_path: Qwen/Qwen2.5-0.5B
datasets:
  - path: stanfordnlp/imdb
  - path: roneneldan/TinyStories
See DatasetConfig and DatasetMixtureConfig for all available dataset mixture keywords.

LoRA training example

Full SFT training with LoRA via the CLI:
trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-4 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16 \
  --output_dir Qwen2-0.5B-SFT-LoRA \
  --push_to_hub

Getting system information

Print system and dependency versions for bug reports:
trl env
This outputs platform, Python, PyTorch, Transformers, Accelerate, TRL, and optional dependency versions.

Build docs developers (and LLMs) love