Speeding up training

This guide covers methods to accelerate training in TRL. Each technique includes minimal examples with links to more comprehensive documentation.

vLLM for fast generation in online methods

Online methods such as GRPO or Online DPO require the model to generate completions, which is often the slowest step. vLLM speeds up generation significantly through PagedAttention and other optimizations. Install vLLM:

pip install trl[vllm]

Start a vLLM server

trl vllm-serve --model <model_name>

Enable vLLM in your training config

GRPO
Online DPO
RLOO

from trl import GRPOConfig

training_args = GRPOConfig(..., use_vllm=True, vllm_mode="server")

from trl.experimental.online_dpo import OnlineDPOConfig

training_args = OnlineDPOConfig(..., use_vllm=True, vllm_mode="server")

from trl import RLOOConfig

training_args = RLOOConfig(..., use_vllm=True, vllm_mode="server")

Ensure that GPUs assigned for training and generation are separate to avoid resource conflicts. For example, with 8 GPUs total:

# GPUs 0–3 for vLLM generation
CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model <model_name>

# GPUs 4–7 for training
CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py

For full configuration options, see the vLLM Integration guide.

Optimized attention implementations

TRL supports optimized attention backends that speed up training while reducing memory usage.

Kernels from Hub (recommended)
Manual build

Use pre-optimized attention kernels from the Hub without manual compilation:

from trl import SFTConfig

training_args = SFTConfig(
    ...,
    model_init_kwargs={"attn_implementation": "kernels-community/flash-attn2"},
)

Other available kernels include kernels-community/vllm-flash-attn3 and kernels-community/paged-attention.For more details, see the Kernels Hub Integration guide.

Manually building optimized attention backends is complex and time-consuming. Use Kernels from the Hub instead unless absolutely necessary.

If you have manually installed an optimized attention backend such as Flash Attention 2:

from trl import SFTConfig

training_args = SFTConfig(
    ...,
    model_init_kwargs={"attn_implementation": "flash_attention_2"},
)

Optimized attention works across all TRL trainers.

Liger Kernel

Liger Kernel is a collection of Triton kernels designed for LLM training. It can increase multi-GPU throughput by 20% and reduce memory usage by 60%.

SFT
DPO
GRPO
KTO
GKD

from trl import SFTConfig

training_args = SFTConfig(..., use_liger_kernel=True)

from trl import DPOConfig

training_args = DPOConfig(..., use_liger_kernel=True)

from trl import GRPOConfig

training_args = GRPOConfig(..., use_liger_kernel=True)

from trl.experimental.kto import KTOConfig

training_args = KTOConfig(..., use_liger_kernel=True)

from trl.experimental.gkd import GKDConfig

training_args = GKDConfig(..., use_liger_kernel=True)

For more details, see the Liger Kernel Integration guide.

Mixed precision training

Mixed precision training using bf16 or fp16 can speed up training and reduce memory usage with minimal impact on model quality.

from trl import SFTConfig

# bfloat16 — recommended for Ampere (A100, RTX 30xx) or newer
training_args = SFTConfig(..., bf16=True)

# float16 — use for older GPUs
training_args = SFTConfig(..., fp16=True)

Mixed precision is supported across all TRL trainers.

Get Started

Concepts

Trainers

How-to Guides

Integrations

vLLM for fast generation in online methods

Optimized attention implementations

Liger Kernel

Mixed precision training

Build docs developers (and LLMs) love

Get Started

Concepts

Trainers

How-to Guides

Integrations

​vLLM for fast generation in online methods

​Optimized attention implementations

​Liger Kernel

​Mixed precision training

Build docs developers (and LLMs) love

vLLM for fast generation in online methods

Optimized attention implementations

Liger Kernel

Mixed precision training