TRL supports PEFT (Parameter-Efficient Fine-Tuning) methods for memory-efficient model training. PEFT enables fine-tuning large language models by training only a small number of additional parameters while keeping the base model frozen, significantly reducing computational costs and memory requirements.
Installation
For QLoRA support (4-bit and 8-bit quantization), also install:
Quick start
All TRL trainers support PEFT through the peft_config argument. The simplest way to enable PEFT is via the CLI with the --use_peft flag:
python trl/scripts/sft.py \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--use_peft \
--lora_r 32 \
--lora_alpha 16 \
--output_dir Qwen2-0.5B-SFT-LoRA
Alternatively, pass a PEFT config directly in Python:
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer
peft_config = LoraConfig(
r = 32 ,
lora_alpha = 16 ,
lora_dropout = 0.05 ,
bias = "none" ,
task_type = "CAUSAL_LM" ,
)
training_args = SFTConfig(
learning_rate = 2.0e-4 , # ~10x the full fine-tuning rate for LoRA
)
trainer = SFTTrainer(
model = model,
args = training_args,
train_dataset = dataset,
peft_config = peft_config,
)
trainer.train()
Use the --use_peft flag with TRL scripts. Best for quick experiments and standard LoRA configurations. python trl/scripts/sft.py \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--use_peft \
--lora_r 32 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--output_dir Qwen2-0.5B-SFT-LoRA
Available CLI flags:
--use_peft — enable PEFT
--lora_r — LoRA rank (default: 16)
--lora_alpha — LoRA alpha (default: 32)
--lora_dropout — LoRA dropout (default: 0.05)
--lora_target_modules — target modules (space-separated)
--lora_modules_to_save — additional modules to train
--use_rslora — enable Rank-Stabilized LoRA
--use_dora — enable Weight-Decomposed LoRA (DoRA)
--load_in_4bit — enable 4-bit quantization (QLoRA)
--load_in_8bit — enable 8-bit quantization
2. Pass peft_config to trainer (recommended)
Pass a PEFT configuration directly to the trainer for full control over PEFT methods including LoRA, Prompt Tuning, and others. from peft import LoraConfig
from trl import SFTConfig, SFTTrainer
peft_config = LoraConfig(
r = 32 ,
lora_alpha = 16 ,
lora_dropout = 0.05 ,
bias = "none" ,
task_type = "CAUSAL_LM" ,
target_modules = [ "q_proj" , "v_proj" , "k_proj" , "o_proj" ],
)
trainer = SFTTrainer(
model = model,
args = training_args,
train_dataset = dataset,
peft_config = peft_config,
)
3. Apply PEFT to the model directly (advanced)
Apply PEFT to the model before passing it to the trainer. Useful for custom architectures or complex setups. from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
from trl import SFTConfig, SFTTrainer
model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2-0.5B" )
peft_config = LoraConfig(
r = 32 ,
lora_alpha = 16 ,
lora_dropout = 0.05 ,
bias = "none" ,
task_type = "CAUSAL_LM" ,
)
model = get_peft_model(model, peft_config)
# No peft_config argument needed — the model is already wrapped
trainer = SFTTrainer(
model = model,
args = training_args,
train_dataset = dataset,
)
Using ModelConfig and get_peft_config
For script-based workflows, TRL provides ModelConfig and the helper functions get_peft_config and get_quantization_config to build PEFT and quantization configs directly from CLI arguments.
from transformers import AutoModelForCausalLM
from trl import ModelConfig, get_peft_config, get_quantization_config, get_kbit_device_map
model_args = ModelConfig(
model_name_or_path = "Qwen/Qwen2-0.5B" ,
use_peft = True ,
lora_r = 32 ,
lora_alpha = 16 ,
lora_dropout = 0.05 ,
lora_target_modules = [ "q_proj" , "v_proj" ],
load_in_4bit = True ,
)
quantization_config = get_quantization_config(model_args)
model = AutoModelForCausalLM.from_pretrained(
model_args.model_name_or_path,
quantization_config = quantization_config,
device_map = get_kbit_device_map(), # returns {"": local_process_index} on GPU
)
peft_config = get_peft_config(model_args) # Returns a LoraConfig or None
get_peft_config reads the following ModelConfig fields:
ModelConfig fieldDefault Description use_peftFalseEnable PEFT. Returns None when False. lora_r16LoRA rank. lora_alpha32LoRA scaling factor. lora_dropout0.05Dropout probability for LoRA layers. lora_target_modulesNoneModules to apply LoRA to. lora_task_type"CAUSAL_LM"Task type ("SEQ_CLS" for reward modeling). use_rsloraFalseUse Rank-Stabilized LoRA. use_doraFalseUse Weight-Decomposed LoRA (DoRA). lora_modules_to_saveNoneAdditional modules to fully train.
get_kbit_device_map
get_kbit_device_map() returns a device map appropriate for k-bit (4-bit or 8-bit) quantized models in multi-GPU environments. Returns {"": local_process_index} when a GPU is available, or None when running on CPU.
from trl import get_kbit_device_map
device_map = get_kbit_device_map()
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2-0.5B" ,
load_in_4bit = True ,
device_map = device_map,
)
Always pass device_map=get_kbit_device_map() when using 4-bit or 8-bit quantization in multi-GPU setups. This ensures each process loads its shard onto the correct GPU.
PEFT with different trainers
python trl/scripts/sft.py \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--learning_rate 2.0e-4 \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--use_peft \
--lora_r 32 \
--lora_alpha 16 \
--output_dir Qwen2-0.5B-SFT-LoRA
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer
peft_config = LoraConfig(
r = 32 ,
lora_alpha = 16 ,
lora_dropout = 0.05 ,
bias = "none" ,
task_type = "CAUSAL_LM" ,
target_modules = [ "q_proj" , "v_proj" ],
)
training_args = SFTConfig( learning_rate = 2.0e-4 )
trainer = SFTTrainer(
model = "Qwen/Qwen2-0.5B" ,
args = training_args,
train_dataset = dataset,
peft_config = peft_config,
)
trainer.train()
When using PEFT with DPO, you do not need to provide a separate ref_model. The trainer automatically uses the frozen base model as the reference.
python trl/scripts/dpo.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--learning_rate 5.0e-6 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--use_peft \
--lora_r 32 \
--lora_alpha 16 \
--output_dir Qwen2-0.5B-DPO-LoRA
from peft import LoraConfig
from trl import DPOConfig, DPOTrainer
peft_config = LoraConfig(
r = 32 ,
lora_alpha = 16 ,
lora_dropout = 0.05 ,
bias = "none" ,
task_type = "CAUSAL_LM" ,
)
training_args = DPOConfig( learning_rate = 5.0e-6 )
trainer = DPOTrainer(
model = "Qwen/Qwen2-0.5B" ,
args = training_args,
train_dataset = dataset,
peft_config = peft_config,
)
trainer.train()
python trl/scripts/grpo.py \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/math-reasoning \
--learning_rate 1.0e-5 \
--per_device_train_batch_size 2 \
--use_peft \
--lora_r 32 \
--lora_alpha 16 \
--output_dir Qwen2-0.5B-GRPO-LoRA
from peft import LoraConfig
from trl import GRPOConfig, GRPOTrainer
peft_config = LoraConfig(
r = 32 ,
lora_alpha = 16 ,
lora_dropout = 0.05 ,
bias = "none" ,
task_type = "CAUSAL_LM" ,
)
training_args = GRPOConfig( learning_rate = 1.0e-5 )
trainer = GRPOTrainer(
model = "Qwen/Qwen2-0.5B" ,
args = training_args,
train_dataset = dataset,
peft_config = peft_config,
)
trainer.train()
Learning rate considerations
When using LoRA or other PEFT methods, use a higher learning rate (approximately 10x) compared to full fine-tuning. PEFT methods train only a small fraction of parameters, requiring a larger learning rate to achieve comparable updates.
Trainer Full fine-tuning With LoRA (~10x) SFT 2.0e-52.0e-4DPO 5.0e-75.0e-6GRPO 1.0e-61.0e-5Prompt Tuning N/A 1.0e-2 to 3.0e-2
QLoRA: quantized low-rank adaptation
QLoRA combines 4-bit quantization with LoRA to enable fine-tuning of very large models on consumer hardware. This can reduce memory requirements by up to 4x compared to standard LoRA.
Load the model in 4-bit
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit = True ,
bnb_4bit_quant_type = "nf4" ,
bnb_4bit_compute_dtype = torch.bfloat16,
bnb_4bit_use_double_quant = True ,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf" ,
quantization_config = bnb_config,
device_map = "auto" ,
)
Attach a LoRA adapter and train
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer
peft_config = LoraConfig(
r = 32 ,
lora_alpha = 16 ,
lora_dropout = 0.05 ,
bias = "none" ,
task_type = "CAUSAL_LM" ,
)
training_args = SFTConfig( learning_rate = 2.0e-4 )
trainer = SFTTrainer(
model = model,
args = training_args,
train_dataset = dataset,
peft_config = peft_config,
)
trainer.train()
The equivalent via CLI:
python trl/scripts/sft.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--dataset_name trl-lib/Capybara \
--load_in_4bit \
--use_peft \
--lora_r 32 \
--lora_alpha 16 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--output_dir Llama-2-7b-QLoRA
BitsAndBytesConfig parameters
Parameter Description bnb_4bit_quant_typeQuantization data type: "nf4" (recommended) or "fp4". bnb_4bit_compute_dtypeCompute dtype for 4-bit layers. Use torch.bfloat16 for stability. bnb_4bit_use_double_quantNested quantization saves ~0.4 bits per parameter.
8-bit quantization
For slightly higher precision at reduced memory savings:
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
bnb_config = BitsAndBytesConfig( load_in_8bit = True )
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf" ,
quantization_config = bnb_config,
device_map = "auto" ,
)
LoRA configuration reference
from peft import LoraConfig
peft_config = LoraConfig(
r = 16 , # LoRA rank
lora_alpha = 32 , # LoRA scaling factor (typically 2x rank)
lora_dropout = 0.05 , # Dropout probability
bias = "none" , # Bias training strategy
task_type = "CAUSAL_LM" , # Task type
target_modules = [ "q_proj" , "v_proj" ], # Modules to apply LoRA to
modules_to_save = None , # Additional modules to fully train
)
Parameter Description rLoRA rank. Typical values: 8, 16, 32, 64. Higher rank = more parameters. lora_alphaScaling factor, typically 2x the rank. Controls the magnitude of LoRA updates. lora_dropoutDropout probability for LoRA layers. Typical range: 0.05–0.1. target_modulesWhich linear layers to apply LoRA to (see below). modules_to_saveAdditional modules to fully train, e.g., ["embed_tokens", "lm_head"].
Target module selection
# Minimal — most memory efficient
target_modules = [ "q_proj" , "v_proj" ]
# Attention projections only
target_modules = [ "q_proj" , "k_proj" , "v_proj" , "o_proj" ]
# All linear layers — best performance, higher memory
target_modules = [ "q_proj" , "k_proj" , "v_proj" , "o_proj" , "gate_proj" , "up_proj" , "down_proj" ]
Prompt tuning
Prompt tuning learns soft prompts (continuous embeddings) prepended to the input while keeping the entire model frozen. It is particularly parameter-efficient for large models.
from peft import PromptTuningConfig, PromptTuningInit, TaskType
from trl import SFTConfig, SFTTrainer
peft_config = PromptTuningConfig(
task_type = TaskType. CAUSAL_LM ,
prompt_tuning_init = PromptTuningInit. TEXT ,
num_virtual_tokens = 8 ,
prompt_tuning_init_text = "Classify if the tweet is a complaint or not:" ,
tokenizer_name_or_path = "Qwen/Qwen2-0.5B" ,
)
training_args = SFTConfig(
learning_rate = 2.0e-2 , # Prompt tuning uses a higher LR (1e-2 to 3e-2)
)
trainer = SFTTrainer(
model = model,
args = training_args,
train_dataset = dataset,
peft_config = peft_config,
)
trainer.train()
Feature Prompt tuning LoRA Parameters trained ~0.001% ~0.1–1% Memory usage Minimal Low Training speed Fastest Fast Model modification None Adapter layers Best for Large models, many tasks General fine-tuning Learning rate Higher (1e-2 to 3e-2) Standard (1e-4 to 3e-4)
Saving and loading PEFT models
# Save adapter weights only (~few MB rather than several GB)
trainer.save_model( "path/to/adapters" )
# Load for inference
from transformers import AutoModelForCausalLM
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2-0.5B" )
model = PeftModel.from_pretrained(base_model, "path/to/adapters" )
# Optionally merge adapters into the base model for faster inference
model = model.merge_and_unload()
Pushing to Hub
# Share adapters on the Hugging Face Hub
model.push_to_hub( "username/model-name-lora" )
# Load from Hub
from peft import PeftModel
model = PeftModel.from_pretrained(base_model, "username/model-name-lora" )
Multi-GPU training
PEFT works with TRL’s multi-GPU support through Accelerate:
acceleate config
acceleate launch trl/scripts/sft.py \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--use_peft \
--lora_r 32 \
--lora_alpha 16
For QLoRA across multiple GPUs, the quantized base model is automatically sharded:
acceleate launch trl/scripts/sft.py \
--model_name_or_path meta-llama/Llama-2-70b-hf \
--load_in_4bit \
--use_peft \
--lora_r 32
Resources
SFT with LoRA/QLoRA notebook Complete working example with both LoRA and QLoRA.
PEFT documentation Official PEFT library documentation.
LoRA paper Original LoRA methodology and results.
QLoRA paper Efficient finetuning of quantized language models.