Skip to main content
TRL supports PEFT (Parameter-Efficient Fine-Tuning) methods for memory-efficient model training. PEFT enables fine-tuning large language models by training only a small number of additional parameters while keeping the base model frozen, significantly reducing computational costs and memory requirements.

Installation

pip install trl[peft]
For QLoRA support (4-bit and 8-bit quantization), also install:
pip install bitsandbytes

Quick start

All TRL trainers support PEFT through the peft_config argument. The simplest way to enable PEFT is via the CLI with the --use_peft flag:
python trl/scripts/sft.py \
    --model_name_or_path Qwen/Qwen2-0.5B \
    --dataset_name trl-lib/Capybara \
    --use_peft \
    --lora_r 32 \
    --lora_alpha 16 \
    --output_dir Qwen2-0.5B-SFT-LoRA
Alternatively, pass a PEFT config directly in Python:
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer

peft_config = LoraConfig(
    r=32,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

training_args = SFTConfig(
    learning_rate=2.0e-4,  # ~10x the full fine-tuning rate for LoRA
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    peft_config=peft_config,
)
trainer.train()

Three ways to configure PEFT

Use the --use_peft flag with TRL scripts. Best for quick experiments and standard LoRA configurations.
python trl/scripts/sft.py \
    --model_name_or_path Qwen/Qwen2-0.5B \
    --dataset_name trl-lib/Capybara \
    --use_peft \
    --lora_r 32 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --output_dir Qwen2-0.5B-SFT-LoRA
Available CLI flags:
  • --use_peft — enable PEFT
  • --lora_r — LoRA rank (default: 16)
  • --lora_alpha — LoRA alpha (default: 32)
  • --lora_dropout — LoRA dropout (default: 0.05)
  • --lora_target_modules — target modules (space-separated)
  • --lora_modules_to_save — additional modules to train
  • --use_rslora — enable Rank-Stabilized LoRA
  • --use_dora — enable Weight-Decomposed LoRA (DoRA)
  • --load_in_4bit — enable 4-bit quantization (QLoRA)
  • --load_in_8bit — enable 8-bit quantization
Apply PEFT to the model before passing it to the trainer. Useful for custom architectures or complex setups.
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
from trl import SFTConfig, SFTTrainer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B")

peft_config = LoraConfig(
    r=32,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, peft_config)

# No peft_config argument needed — the model is already wrapped
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

Using ModelConfig and get_peft_config

For script-based workflows, TRL provides ModelConfig and the helper functions get_peft_config and get_quantization_config to build PEFT and quantization configs directly from CLI arguments.
from transformers import AutoModelForCausalLM
from trl import ModelConfig, get_peft_config, get_quantization_config, get_kbit_device_map

model_args = ModelConfig(
    model_name_or_path="Qwen/Qwen2-0.5B",
    use_peft=True,
    lora_r=32,
    lora_alpha=16,
    lora_dropout=0.05,
    lora_target_modules=["q_proj", "v_proj"],
    load_in_4bit=True,
)

quantization_config = get_quantization_config(model_args)
model = AutoModelForCausalLM.from_pretrained(
    model_args.model_name_or_path,
    quantization_config=quantization_config,
    device_map=get_kbit_device_map(),  # returns {"": local_process_index} on GPU
)

peft_config = get_peft_config(model_args)  # Returns a LoraConfig or None
get_peft_config reads the following ModelConfig fields:
ModelConfig fieldDefaultDescription
use_peftFalseEnable PEFT. Returns None when False.
lora_r16LoRA rank.
lora_alpha32LoRA scaling factor.
lora_dropout0.05Dropout probability for LoRA layers.
lora_target_modulesNoneModules to apply LoRA to.
lora_task_type"CAUSAL_LM"Task type ("SEQ_CLS" for reward modeling).
use_rsloraFalseUse Rank-Stabilized LoRA.
use_doraFalseUse Weight-Decomposed LoRA (DoRA).
lora_modules_to_saveNoneAdditional modules to fully train.

get_kbit_device_map

get_kbit_device_map() returns a device map appropriate for k-bit (4-bit or 8-bit) quantized models in multi-GPU environments. Returns {"": local_process_index} when a GPU is available, or None when running on CPU.
from trl import get_kbit_device_map

device_map = get_kbit_device_map()
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-0.5B",
    load_in_4bit=True,
    device_map=device_map,
)
Always pass device_map=get_kbit_device_map() when using 4-bit or 8-bit quantization in multi-GPU setups. This ensures each process loads its shard onto the correct GPU.

PEFT with different trainers

python trl/scripts/sft.py \
    --model_name_or_path Qwen/Qwen2-0.5B \
    --dataset_name trl-lib/Capybara \
    --learning_rate 2.0e-4 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --use_peft \
    --lora_r 32 \
    --lora_alpha 16 \
    --output_dir Qwen2-0.5B-SFT-LoRA
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer

peft_config = LoraConfig(
    r=32,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj"],
)

training_args = SFTConfig(learning_rate=2.0e-4)

trainer = SFTTrainer(
    model="Qwen/Qwen2-0.5B",
    args=training_args,
    train_dataset=dataset,
    peft_config=peft_config,
)
trainer.train()

Learning rate considerations

When using LoRA or other PEFT methods, use a higher learning rate (approximately 10x) compared to full fine-tuning. PEFT methods train only a small fraction of parameters, requiring a larger learning rate to achieve comparable updates.
TrainerFull fine-tuningWith LoRA (~10x)
SFT2.0e-52.0e-4
DPO5.0e-75.0e-6
GRPO1.0e-61.0e-5
Prompt TuningN/A1.0e-2 to 3.0e-2

QLoRA: quantized low-rank adaptation

QLoRA combines 4-bit quantization with LoRA to enable fine-tuning of very large models on consumer hardware. This can reduce memory requirements by up to 4x compared to standard LoRA.
1

Install bitsandbytes

pip install bitsandbytes
2

Load the model in 4-bit

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)
3

Attach a LoRA adapter and train

from peft import LoraConfig
from trl import SFTConfig, SFTTrainer

peft_config = LoraConfig(
    r=32,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

training_args = SFTConfig(learning_rate=2.0e-4)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    peft_config=peft_config,
)
trainer.train()
The equivalent via CLI:
python trl/scripts/sft.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset_name trl-lib/Capybara \
    --load_in_4bit \
    --use_peft \
    --lora_r 32 \
    --lora_alpha 16 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --output_dir Llama-2-7b-QLoRA

BitsAndBytesConfig parameters

ParameterDescription
bnb_4bit_quant_typeQuantization data type: "nf4" (recommended) or "fp4".
bnb_4bit_compute_dtypeCompute dtype for 4-bit layers. Use torch.bfloat16 for stability.
bnb_4bit_use_double_quantNested quantization saves ~0.4 bits per parameter.

8-bit quantization

For slightly higher precision at reduced memory savings:
from transformers import BitsAndBytesConfig, AutoModelForCausalLM

bnb_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

LoRA configuration reference

from peft import LoraConfig

peft_config = LoraConfig(
    r=16,                          # LoRA rank
    lora_alpha=32,                 # LoRA scaling factor (typically 2x rank)
    lora_dropout=0.05,             # Dropout probability
    bias="none",                   # Bias training strategy
    task_type="CAUSAL_LM",         # Task type
    target_modules=["q_proj", "v_proj"],  # Modules to apply LoRA to
    modules_to_save=None,          # Additional modules to fully train
)
ParameterDescription
rLoRA rank. Typical values: 8, 16, 32, 64. Higher rank = more parameters.
lora_alphaScaling factor, typically 2x the rank. Controls the magnitude of LoRA updates.
lora_dropoutDropout probability for LoRA layers. Typical range: 0.05–0.1.
target_modulesWhich linear layers to apply LoRA to (see below).
modules_to_saveAdditional modules to fully train, e.g., ["embed_tokens", "lm_head"].

Target module selection

# Minimal — most memory efficient
target_modules=["q_proj", "v_proj"]

# Attention projections only
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]

# All linear layers — best performance, higher memory
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

Prompt tuning

Prompt tuning learns soft prompts (continuous embeddings) prepended to the input while keeping the entire model frozen. It is particularly parameter-efficient for large models.
from peft import PromptTuningConfig, PromptTuningInit, TaskType
from trl import SFTConfig, SFTTrainer

peft_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    num_virtual_tokens=8,
    prompt_tuning_init_text="Classify if the tweet is a complaint or not:",
    tokenizer_name_or_path="Qwen/Qwen2-0.5B",
)

training_args = SFTConfig(
    learning_rate=2.0e-2,  # Prompt tuning uses a higher LR (1e-2 to 3e-2)
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    peft_config=peft_config,
)
trainer.train()
FeaturePrompt tuningLoRA
Parameters trained~0.001%~0.1–1%
Memory usageMinimalLow
Training speedFastestFast
Model modificationNoneAdapter layers
Best forLarge models, many tasksGeneral fine-tuning
Learning rateHigher (1e-2 to 3e-2)Standard (1e-4 to 3e-4)

Saving and loading PEFT models

# Save adapter weights only (~few MB rather than several GB)
trainer.save_model("path/to/adapters")

# Load for inference
from transformers import AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B")
model = PeftModel.from_pretrained(base_model, "path/to/adapters")

# Optionally merge adapters into the base model for faster inference
model = model.merge_and_unload()

Pushing to Hub

# Share adapters on the Hugging Face Hub
model.push_to_hub("username/model-name-lora")

# Load from Hub
from peft import PeftModel
model = PeftModel.from_pretrained(base_model, "username/model-name-lora")

Multi-GPU training

PEFT works with TRL’s multi-GPU support through Accelerate:
acceleate config
acceleate launch trl/scripts/sft.py \
    --model_name_or_path Qwen/Qwen2-0.5B \
    --dataset_name trl-lib/Capybara \
    --use_peft \
    --lora_r 32 \
    --lora_alpha 16
For QLoRA across multiple GPUs, the quantized base model is automatically sharded:
acceleate launch trl/scripts/sft.py \
    --model_name_or_path meta-llama/Llama-2-70b-hf \
    --load_in_4bit \
    --use_peft \
    --lora_r 32

Resources

SFT with LoRA/QLoRA notebook

Complete working example with both LoRA and QLoRA.

PEFT documentation

Official PEFT library documentation.

LoRA paper

Original LoRA methodology and results.

QLoRA paper

Efficient finetuning of quantized language models.

Build docs developers (and LLMs) love