Available commands
trl sft
Supervised fine-tuning
trl dpo
Direct Preference Optimization
trl grpo
Group Relative Policy Optimization
trl rloo
REINFORCE Leave-One-Out
trl kto
Kahneman-Tversky Optimization
trl reward
Reward model training
trl env— print system and dependency informationtrl vllm-serve— start a vLLM generation servertrl skills— manage TRL agent skills
Basic usage
Specify the model and dataset directly as flags:- SFT
- DPO
- GRPO
- RLOO
- KTO
- Reward
Key flags
Model flags (ModelConfig)
| Flag | Default | Description |
|---|---|---|
--model_name_or_path | Model checkpoint or Hub ID | |
--model_revision | main | Branch, tag, or commit ID |
--dtype | float32 | Model dtype: auto, bfloat16, float16, float32 |
--attn_implementation | Attention backend (e.g. flash_attention_2, kernels-community/flash-attn2) | |
--trust_remote_code | false | Allow custom model code from the Hub |
--use_peft | false | Enable PEFT/LoRA training |
--lora_r | 16 | LoRA rank |
--lora_alpha | 32 | LoRA scaling factor |
--lora_dropout | 0.05 | LoRA dropout |
--lora_target_modules | Modules to apply LoRA to | |
--load_in_4bit | false | Load base model in 4-bit (QLoRA) |
--load_in_8bit | false | Load base model in 8-bit |
--bnb_4bit_quant_type | nf4 | 4-bit quantization type: nf4 or fp4 |
Training flags (shared across trainers)
| Flag | Description |
|---|---|
--output_dir | Directory to save the trained model |
--learning_rate | Learning rate |
--num_train_epochs | Number of training epochs |
--max_steps | Maximum number of training steps (overrides epochs) |
--per_device_train_batch_size | Batch size per GPU |
--gradient_accumulation_steps | Steps to accumulate gradients before updating |
--bf16 | Enable bfloat16 mixed precision |
--fp16 | Enable float16 mixed precision |
--eval_strategy | Evaluation strategy: no, steps, epoch |
--eval_steps | Evaluate every N steps (when eval_strategy=steps) |
--push_to_hub | Push trained model to the Hugging Face Hub |
--gradient_checkpointing | Enable gradient checkpointing |
SFT-specific flags
| Flag | Description |
|---|---|
--max_length | Maximum sequence length for truncation |
--packing | Enable sequence packing |
--packing_strategy | Packing strategy: bfd, bfd_split, or wrapped |
--eos_token | EOS token string (e.g. <|im_end|>) |
DPO-specific flags
| Flag | Description |
|---|---|
--max_length | Maximum combined prompt+completion length |
--beta | KL penalty coefficient |
--loss_type | DPO loss type (e.g. sigmoid, hinge, ipo) |
GRPO-specific flags
| Flag | Description |
|---|---|
--reward_funcs | Built-in reward functions to use (e.g. accuracy_reward, think_format_reward) |
--reward_model_name_or_path | External reward model Hub ID or local path |
--use_vllm | Enable vLLM for fast generation |
--vllm_mode | vLLM mode: server |
reward_funcs values for GRPO and RLOO:
accuracy_rewardreasoning_accuracy_rewardthink_format_rewardget_soft_overlong_punishment- Any dotted import path (e.g.
my_lib.rewards.custom_reward)
Using config files
Define all training arguments in a YAML config file for cleaner, reproducible runs:- SFT
- DPO
- GRPO
--config override values in the file.
Multi-GPU and distributed training
The TRL CLI natively supports Accelerate. Pass anyaccelerate launch argument directly, such as --num_processes:
Using --accelerate_config
The --accelerate_config flag selects a distributed training strategy. It accepts either a predefined profile name or a path to a custom Accelerate YAML config file.
Predefined profiles:
| Name | Description |
|---|---|
single_gpu | Single-GPU training |
multi_gpu | Multi-GPU with DDP |
fsdp1 | Fully Sharded Data Parallel Stage 1 |
fsdp2 | Fully Sharded Data Parallel Stage 2 |
zero1 | DeepSpeed ZeRO Stage 1 |
zero2 | DeepSpeed ZeRO Stage 2 |
zero3 | DeepSpeed ZeRO Stage 3 |
Dataset mixtures
Combine multiple datasets into a single training dataset using thedatasets key in your config file:
- SFT
- DPO
- GRPO
DatasetConfig and DatasetMixtureConfig for all available dataset mixture keywords.