Skip to main content
Training workflows can often be optimized to reduce memory consumption. TRL provides several built-in features to help achieve this. Experiment with different combinations to find the configuration that works best for your setup. Each method below includes examples for the supported trainers.

Truncation

Sequences in a dataset can vary widely in length. When batched, all sequences are padded to the longest one in the batch, which can cause high memory usage even when most sequences are short. To reduce this, truncate sequences to a reasonable length using the max_length parameter.
DPO truncation is controlled via max_length, which truncates the combined prompt+completion sequence.
from trl import DPOConfig

training_args = DPOConfig(..., max_length=512)
The legacy max_prompt_length and max_completion_length parameters have been removed. Filter or pre-truncate overlong prompts and completions in your dataset before training.

Choosing the right max_length

  • Too small: important tokens at the end of sequences are discarded, reducing training quality.
  • Too large: memory spikes and potential out-of-memory (OOM) errors. Without packing or padding-free batching, many tokens will also be padding.
Visualize the sequence length distribution in your dataset to pick an appropriate value before training.

Packing

Packing is available only for SFT training and setups that use FlashAttention or its variants.
Packing mitigates truncation drawbacks by grouping multiple sequences into the same training row, filling each row up to max_length. TRL uses Best-Fit Decreasing (BFD) bin packing to group sequences efficiently. Three strategies are supported:
Uses Best-Fit Decreasing packing. If a sequence exceeds max_length, the overflow tokens are discarded.
Uses Best-Fit Decreasing packing, but long sequences are split into chunks of at most max_length before packing. This preserves all tokens.
All tokens are concatenated into a stream and split into fixed-length blocks. Minimizes padding but may mix unrelated examples, which can hurt performance.
from trl import SFTConfig

training_args = SFTConfig(
    ...,
    packing=True,
    packing_strategy="bfd",
    max_length=512,
)
If all sequences are shorter than max_length, bfd and bfd_split behave identically since no truncation or splitting is required.

PEFT for parameter-efficient fine-tuning

Methods like LoRA train only a small number of adapter parameters instead of all model weights, significantly reducing memory requirements and enabling fine-tuning of larger models on limited hardware.
from datasets import load_dataset
from peft import LoraConfig
from trl import SFTTrainer

dataset = load_dataset("trl-lib/Capybara", split="train")

peft_config = LoraConfig()

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
    peft_config=peft_config,
)
PEFT can be combined with quantization (4-bit or 8-bit) for even greater memory savings. See the PEFT Integration guide for quantization examples.

Liger Kernel

Liger Kernel is a collection of Triton kernels for LLM training that can increase multi-GPU throughput by 20% and reduce memory usage by 60%.
from trl import SFTConfig

training_args = SFTConfig(..., use_liger_kernel=True)
For more details, see the Liger Kernel Integration guide.

Padding-free batching

Padding-free batching flattens a batch into a single sequence, avoiding padding entirely. Unlike packing, it keeps all sequences complete and intact.
Use padding-free batching with FlashAttention 2 or FlashAttention 3. Without it, you may encounter batch contamination.
from trl import DPOConfig

training_args = DPOConfig(
    ...,
    padding_free=True,
    model_init_kwargs={"attn_implementation": "kernels-community/flash-attn2"},
)

Activation offloading

Activation offloading reduces GPU VRAM usage by moving activation tensors to CPU RAM during the forward pass and bringing them back during the backward pass. It reduces peak memory at the cost of slightly increased training time.
from trl import SFTConfig

training_args = SFTConfig(..., activation_offloading=True)
Under the hood, TRL uses PyTorch’s saved_tensors_hooks to intercept activations during the forward pass. CUDA streams are used by default to overlap computation with CPU-GPU transfers.

Gradient checkpointing

Gradient checkpointing trades compute for memory by recomputing intermediate activations during the backward pass instead of storing them.
from trl import SFTConfig

training_args = SFTConfig(..., gradient_checkpointing=True)
Gradient checkpointing is enabled by default in all TRL trainers. Disable it by setting gradient_checkpointing=False if needed.
For more memory optimization techniques, see the Transformers performance guide.

Disabling model gathering for generation (DeepSpeed ZeRO-3)

When using DeepSpeed ZeRO-3, model weights are sharded across GPUs. Online methods gather weights onto a single GPU for generation, which can cause OOM errors on large models. Disable gathering with:
from trl import GRPOConfig

training_args = GRPOConfig(..., ds3_gather_for_generation=False)
This avoids OOM errors but may result in slower generation speeds.

vLLM sleep mode

When using vLLM as the generation backend for online training, enable sleep mode to offload vLLM parameters and cache to CPU RAM during the optimization step and reload them when needed.
from trl import GRPOConfig

training_args = GRPOConfig(..., vllm_enable_sleep_mode=True)
Offloading keeps GPU memory usage low, particularly useful when training large models or using limited GPU resources. Waking the vLLM engine from sleep mode introduces some host–device transfer latency.

Padding sequences to a multiple

This technique is currently supported for SFT and Reward trainers.
Pad all sequences to a multiple of a given value to improve computational efficiency on hardware that benefits from aligned sequence lengths.
from trl import SFTConfig

training_args = SFTConfig(..., pad_to_multiple_of=2048)

Build docs developers (and LLMs) love