RMSNorm, RoPE, SwiGLU, CrossEntropy, and FusedLinearCrossEntropy. It works out of the box with FlashAttention, PyTorch FSDP, and Microsoft DeepSpeed.
With the memory reduction from Liger Kernel, you can potentially disable cpu_offloading or gradient checkpointing to further boost performance.
Installation
Supported trainers
Liger Kernel is supported in the following TRL trainers:SFT
Supervised Fine-Tuning
DPO
Direct Preference Optimization
GRPO
Group Relative Policy Optimization
KTO
Kahneman-Tversky Optimization
GKD
Generalized Knowledge Distillation
Usage
Setuse_liger_kernel=True in your trainer config. No other changes are needed.
- SFT
- DPO
- GRPO
- KTO
- GKD
Performance benefits
| Metric | Improvement |
|---|---|
| Training throughput | +20% on multi-GPU setups |
| GPU memory usage | −60% |
| Achievable context length | Up to 4x longer |
FusedLinearCrossEntropy fuses the final linear projection with the cross-entropy loss, which removes the need to store the full vocabulary-sized logit tensor.
Additional resources
Liger Kernel repository
Source code, benchmarks, and detailed documentation.