Truncation
Sequences in a dataset can vary widely in length. When batched, all sequences are padded to the longest one in the batch, which can cause high memory usage even when most sequences are short. To reduce this, truncate sequences to a reasonable length using themax_length parameter.
- DPO
- SFT
DPO truncation is controlled via
max_length, which truncates the combined prompt+completion sequence.Choosing the right max_length
- Too small: important tokens at the end of sequences are discarded, reducing training quality.
- Too large: memory spikes and potential out-of-memory (OOM) errors. Without packing or padding-free batching, many tokens will also be padding.
Packing
Packing mitigates truncation drawbacks by grouping multiple sequences into the same training row, filling each row up tomax_length. TRL uses Best-Fit Decreasing (BFD) bin packing to group sequences efficiently.
Three strategies are supported:
bfd (default)
bfd (default)
Uses Best-Fit Decreasing packing. If a sequence exceeds
max_length, the overflow tokens are discarded.bfd_split
bfd_split
Uses Best-Fit Decreasing packing, but long sequences are split into chunks of at most
max_length before packing. This preserves all tokens.wrapped
wrapped
All tokens are concatenated into a stream and split into fixed-length blocks. Minimizes padding but may mix unrelated examples, which can hurt performance.
If all sequences are shorter than
max_length, bfd and bfd_split behave identically since no truncation or splitting is required.PEFT for parameter-efficient fine-tuning
Methods like LoRA train only a small number of adapter parameters instead of all model weights, significantly reducing memory requirements and enabling fine-tuning of larger models on limited hardware.Liger Kernel
Liger Kernel is a collection of Triton kernels for LLM training that can increase multi-GPU throughput by 20% and reduce memory usage by 60%.- SFT
- DPO
- GRPO
- KTO
- GKD
Padding-free batching
Padding-free batching flattens a batch into a single sequence, avoiding padding entirely. Unlike packing, it keeps all sequences complete and intact.- DPO
- SFT
Activation offloading
Activation offloading reduces GPU VRAM usage by moving activation tensors to CPU RAM during the forward pass and bringing them back during the backward pass. It reduces peak memory at the cost of slightly increased training time.saved_tensors_hooks to intercept activations during the forward pass. CUDA streams are used by default to overlap computation with CPU-GPU transfers.
Gradient checkpointing
Gradient checkpointing trades compute for memory by recomputing intermediate activations during the backward pass instead of storing them.Gradient checkpointing is enabled by default in all TRL trainers. Disable it by setting
gradient_checkpointing=False if needed.Disabling model gathering for generation (DeepSpeed ZeRO-3)
When using DeepSpeed ZeRO-3, model weights are sharded across GPUs. Online methods gather weights onto a single GPU for generation, which can cause OOM errors on large models. Disable gathering with:- GRPO
- RLOO
- Online DPO
- PPO
vLLM sleep mode
When using vLLM as the generation backend for online training, enable sleep mode to offload vLLM parameters and cache to CPU RAM during the optimization step and reload them when needed.- GRPO
- RLOO
Padding sequences to a multiple
Pad all sequences to a multiple of a given value to improve computational efficiency on hardware that benefits from aligned sequence lengths.- SFT
- Reward