Skip to main content
This guide covers methods to accelerate training in TRL. Each technique includes minimal examples with links to more comprehensive documentation.

vLLM for fast generation in online methods

Online methods such as GRPO or Online DPO require the model to generate completions, which is often the slowest step. vLLM speeds up generation significantly through PagedAttention and other optimizations. Install vLLM:
pip install trl[vllm]
1

Start a vLLM server

trl vllm-serve --model <model_name>
2

Enable vLLM in your training config

from trl import GRPOConfig

training_args = GRPOConfig(..., use_vllm=True, vllm_mode="server")
Ensure that GPUs assigned for training and generation are separate to avoid resource conflicts. For example, with 8 GPUs total:
# GPUs 0–3 for vLLM generation
CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model <model_name>

# GPUs 4–7 for training
CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py
For full configuration options, see the vLLM Integration guide.

Optimized attention implementations

TRL supports optimized attention backends that speed up training while reducing memory usage. Optimized attention works across all TRL trainers.

Liger Kernel

Liger Kernel is a collection of Triton kernels designed for LLM training. It can increase multi-GPU throughput by 20% and reduce memory usage by 60%.
from trl import SFTConfig

training_args = SFTConfig(..., use_liger_kernel=True)
For more details, see the Liger Kernel Integration guide.

Mixed precision training

Mixed precision training using bf16 or fp16 can speed up training and reduce memory usage with minimal impact on model quality.
from trl import SFTConfig

# bfloat16 — recommended for Ampere (A100, RTX 30xx) or newer
training_args = SFTConfig(..., bf16=True)

# float16 — use for older GPUs
training_args = SFTConfig(..., fp16=True)
Mixed precision is supported across all TRL trainers.

Build docs developers (and LLMs) love