LoRA Fine-tuning

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that trains small adapter layers while keeping the pretrained model frozen. This dramatically reduces memory requirements and training time compared to full-parameter fine-tuning.

Overview

LoRA achieves efficient fine-tuning by:

Training only 0.1-1% of model parameters
Keeping original model weights frozen
Adding trainable low-rank decomposition matrices to attention layers
Enabling single-GPU training for 7B models
Allowing multiple adapters for different tasks

LoRA provides 90-95% of full fine-tuning performance with only 20-40% of the memory requirements.

When to Use LoRA

Choose LoRA when:

You have single GPU with 16-40GB memory
You need fast iteration on different tasks
You want to maintain multiple task-specific adapters
You need quick deployment without merging weights
Your task requires moderate adaptation from pretrained behavior

Hardware Requirements

Memory Requirements

Qwen-7B LoRA Fine-tuning (Single A100-80GB):

Sequence Length	LoRA Memory	LoRA (emb) Memory	Speed
256	20.1GB	33.7GB	1.2s/iter
512	20.4GB	34.1GB	1.5s/iter
1024	21.5GB	35.2GB	2.8s/iter
2048	23.8GB	35.1GB	5.2s/iter
4096	29.7GB	39.2GB	10.1s/iter
8192	36.6GB	48.5GB	21.3s/iter

LoRA (emb) refers to training with embedding and output layers as trainable parameters, required when fine-tuning base models with new tokens.

GPU Recommendations by Model Size

Model	Minimum GPU	Recommended GPU	Memory (LoRA)
Qwen-1.8B	RTX 3090 (24GB)	RTX 4090 (24GB)	6.7GB
Qwen-7B	RTX A6000 (48GB)	A100 (40GB/80GB)	20.1GB
Qwen-14B	A100 (40GB)	A100 (80GB)	~35GB
Qwen-72B	A100 (80GB) × 4	A100 (80GB) × 4	Requires ZeRO-3

Installation

# Install base requirements
pip install -r requirements.txt

# Install PEFT for LoRA support
pip install "peft<0.8.0"

# Install DeepSpeed (for distributed training)
pip install deepspeed

# Optional: Flash Attention 2 for speed
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .

Use peft<0.8.0 to avoid tokenizer loading issues. Version 0.8.0+ has a known bug with Qwen tokenizer.

LoRA Configuration

LoRA adds trainable rank decomposition matrices to specific model layers:

# From finetune.py lines 335-343
lora_config = LoraConfig(
    r=64,                                           # Rank of decomposition
    lora_alpha=16,                                  # Scaling factor
    target_modules=["c_attn", "c_proj", "w1", "w2"], # Target layers
    lora_dropout=0.05,                              # Dropout rate
    bias="none",                                    # Bias handling
    task_type="CAUSAL_LM",
    modules_to_save=None                            # Additional trainable modules
)

Parameter Explanation

int

default:64

Rank of the low-rank decomposition matrices. Higher rank = more capacity but more memory.

r=8: Very efficient, good for simple tasks
r=16-32: Balanced, suitable for most tasks
r=64: Higher capacity, recommended default
r=128: Maximum capacity, for complex tasks

lora_alpha

int

default:16

Scaling factor for LoRA updates. Affects learning rate.Scaling = lora_alpha / r

Common pattern: lora_alpha = r/4 or r/2
Does not affect trainable parameters
Adjust if model underfits or overfits

target_modules

list

required

Model layers where LoRA adapters are applied.For Qwen: ["c_attn", "c_proj", "w1", "w2"]

c_attn: Attention query, key, value projections
c_proj: Attention output projection
w1, w2: Feed-forward network layers

lora_dropout

float

Dropout probability for LoRA layers (regularization).

modules_to_save

list

Additional modules to train beyond LoRA adapters.For base models: ["wte", "lm_head"] (embedding and output layers)For chat models: None (not needed)

Single-GPU Training

Basic Training Script

finetune/finetune_lora_single_gpu.sh

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1

MODEL="Qwen/Qwen-7B-Chat"
DATA="path_to_data.json"

export CUDA_VISIBLE_DEVICES=0

python finetune.py \
  --model_name_or_path $MODEL \
  --data_path $DATA \
  --bf16 True \
  --output_dir output_qwen \
  --num_train_epochs 5 \
  --per_device_train_batch_size 2 \
  --per_device_eval_batch_size 1 \
  --gradient_accumulation_steps 8 \
  --evaluation_strategy "no" \
  --save_strategy "steps" \
  --save_steps 1000 \
  --save_total_limit 10 \
  --learning_rate 3e-4 \
  --weight_decay 0.1 \
  --adam_beta2 0.95 \
  --warmup_ratio 0.01 \
  --lr_scheduler_type "cosine" \
  --logging_steps 1 \
  --report_to "none" \
  --model_max_length 512 \
  --lazy_preprocess True \
  --gradient_checkpointing \
  --use_lora

Running Single-GPU Training

Prepare Your Data

Create training data in JSON format:

[
  {
    "id": "example_1",
    "conversations": [
      {"from": "user", "value": "Hello, how are you?"},
      {"from": "assistant", "value": "I'm doing great! How can I help you today?"}
    ]
  }
]

Launch Training

bash finetune/finetune_lora_single_gpu.sh \
  -m Qwen/Qwen-7B-Chat \
  -d train_data.json

Monitor Training

Watch training progress:

***** Running training *****
  Num examples = 1000
  Num Epochs = 5
  Total train batch size = 16

{'loss': 1.234, 'learning_rate': 0.0003, 'epoch': 0.1}
{'loss': 0.876, 'learning_rate': 0.00029, 'epoch': 0.2}

The LoRA adapter is saved to output_qwen/.

Multi-GPU Training

For faster training or larger models, use distributed LoRA training:

finetune/finetune_lora_ds.sh

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1

GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
MASTER_ADDR=${MASTER_ADDR:-localhost}
MASTER_PORT=${MASTER_PORT:-6001}

MODEL="Qwen/Qwen-7B-Chat"
DATA="path_to_data.json"
DS_CONFIG_PATH="finetune/ds_config_zero2.json"

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

torchrun $DISTRIBUTED_ARGS finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --bf16 True \
    --output_dir output_qwen \
    --num_train_epochs 5 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 3e-4 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --model_max_length 512 \
    --lazy_preprocess True \
    --use_lora \
    --gradient_checkpointing \
    --deepspeed ${DS_CONFIG_PATH}

DeepSpeed ZeRO-2 Configuration

finetune/ds_config_zero2.json

{
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

ZeRO-2 shards optimizer states and gradients across GPUs, but keeps model parameters replicated. This is ideal for LoRA since adapter parameters are small.

Base Model vs Chat Model

Key differences when fine-tuning base models vs chat models:

Fine-tuning Chat Models (Recommended)

# Qwen-7B-Chat already knows ChatML format
python finetune.py \
  --model_name_or_path Qwen/Qwen-7B-Chat \
  --use_lora \
  --data_path data.json

Advantages:

Lower memory usage (no extra trainable parameters)
Compatible with DeepSpeed ZeRO-3
No special handling needed
Recommended for most use cases

Fine-tuning Base Models

# From finetune.py lines 331-334
is_chat_model = 'chat' in model_args.model_name_or_path.lower()
if training_args.use_lora and not lora_args.q_lora and not is_chat_model:
    modules_to_save = ["wte", "lm_head"]  # Embedding and output layers

When fine-tuning base models:

Automatically enables training of embedding (wte) and output (lm_head) layers
Required for model to learn ChatML special tokens
Higher memory usage (~13.6GB extra for Qwen-7B)
Cannot use ZeRO-3 (must use ZeRO-2)

Fine-tuning base models with LoRA requires significantly more memory. Consider using chat models instead.

Loading and Using LoRA Adapters

Load Adapter for Inference

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

# Load model with LoRA adapter
model = AutoPeftModelForCausalLM.from_pretrained(
    "output_qwen",  # Path to adapter directory
    device_map="auto",
    trust_remote_code=True
).eval()

tokenizer = AutoTokenizer.from_pretrained(
    "output_qwen",
    trust_remote_code=True
)

# Use the model
response, history = model.chat(tokenizer, "Hello!", history=None)
print(response)

Merge Adapter with Base Model

For deployment, you can merge the adapter into the base model:

from peft import AutoPeftModelForCausalLM

# Load model with adapter
model = AutoPeftModelForCausalLM.from_pretrained(
    "output_qwen",
    device_map="auto",
    trust_remote_code=True
)

# Merge and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained(
    "merged_model",
    max_shard_size="2048MB",
    safe_serialization=True
)

# Save tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "output_qwen",
    trust_remote_code=True
)
tokenizer.save_pretrained("merged_model")

After saving merged model, manually copy *.cu and *.cpp files if you need KV cache quantization support.

Switch Between Multiple Adapters

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
)

# Load adapter 1
model = PeftModel.from_pretrained(base_model, "adapter_1")
response1, _ = model.chat(tokenizer, "Test query", history=None)

# Switch to adapter 2
model.load_adapter("adapter_2")
response2, _ = model.chat(tokenizer, "Test query", history=None)

print(f"Adapter 1: {response1}")
print(f"Adapter 2: {response2}")

Hyperparameter Tuning

Learning Rate

--learning_rate 3e-4

LoRA uses higher learning rates than full fine-tuning:

Conservative: 1e-4 (safer for base models)
Standard: 3e-4 (recommended for chat models)
Aggressive: 5e-4 (fast convergence, watch for instability)

LoRA adapters benefit from higher learning rates because only a small subset of parameters is being trained.

LoRA Rank (r)

Adjust based on task complexity:

# Simple classification, minor style changes
--lora_r 8 \
--lora_alpha 4

Batch Size Optimization

--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8

Adjust based on GPU memory:

GPU Memory	Batch Size	Grad Accum	Effective Batch
16GB	1	16	16
24GB	2	8	16
40GB	4	4	16
80GB	8	2	16

Advanced Techniques

Custom Target Modules

Target specific layers for your use case:

# Fine-tune only attention layers
lora_config = LoraConfig(
    target_modules=["c_attn", "c_proj"],
    r=64,
    lora_alpha=16
)

# Fine-tune all linear layers
lora_config = LoraConfig(
    target_modules=["c_attn", "c_proj", "w1", "w2", "lm_head"],
    r=64,
    lora_alpha=16
)

LoRA with Custom Tokens

If adding new tokens to vocabulary:

# Add custom tokens
tokenizer.add_tokens(["[CUSTOM1]", "[CUSTOM2]"])
model.resize_token_embeddings(len(tokenizer))

# Configure LoRA to train new embeddings
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["c_attn", "c_proj", "w1", "w2"],
    modules_to_save=["wte", "lm_head"]  # Train embedding layer
)

Quantization After LoRA Training

Quantize your merged LoRA model for deployment:

# First merge the adapter
python merge_lora.py \
  --adapter_path output_qwen \
  --output_path merged_model

# Then quantize
python run_gptq.py \
  --model_name_or_path merged_model \
  --data_path calibration_data.json \
  --out_path merged_model_int4 \
  --bits 4

See Q-LoRA documentation for details.

Monitoring Training

TensorBoard Integration

python finetune.py \
  --use_lora \
  --report_to "tensorboard" \
  --logging_dir ./logs

View training metrics:

tensorboard --logdir ./logs

Weights & Biases Integration

pip install wandb
wandb login

python finetune.py \
  --use_lora \
  --report_to "wandb" \
  --run_name "qwen-lora-experiment"

Troubleshooting

PEFT Version Errors

Issue: ValueError: Tokenizer class QWenTokenizer does not existSolution: Downgrade PEFT

pip install "peft<0.8.0"

Out of Memory During Training

Solutions:

Reduce per_device_train_batch_size to 1
Reduce model_max_length (e.g., 512 → 256)
Enable gradient checkpointing: --gradient_checkpointing
Reduce LoRA rank: --lora_r 32 or --lora_r 16
Use Q-LoRA instead (see Q-LoRA guide)

Adapter Not Learning (High Loss)

Possible causes:

Learning rate too low: Try --learning_rate 5e-4
LoRA rank too small: Increase --lora_r 128
Data quality issues: Review training samples
Insufficient training: Increase epochs

Debug:

# Check trainable parameters
model.print_trainable_parameters()
# Expected output: "trainable params: X || all params: Y || trainable%: Z%"

DeepSpeed Compatibility Issues

Issue: ZeRO-3 incompatible with base model LoRASolution: Use ZeRO-2 or switch to chat model

--deepspeed finetune/ds_config_zero2.json
# OR
--model_name_or_path Qwen/Qwen-7B-Chat  # Use chat model

Missing Files After Saving

Issue: *.cu and *.cpp files missing from saved adapterSolution: Manually copy from source

cp Qwen/Qwen-7B-Chat/*.cu output_qwen/
cp Qwen/Qwen-7B-Chat/*.cpp output_qwen/

Performance Comparison

LoRA vs Full-Parameter (Qwen-7B)

Metric	Full-Parameter	LoRA	Difference
GPU Memory	~80GB (2 GPUs)	20.1GB (1 GPU)	4x reduction
Training Speed	2.5s/iter	1.2s/iter	2x faster
Trainable Params	7B (100%)	70M (1%)	100x fewer
Final Performance	100%	90-95%	Minimal loss

LoRA vs Q-LoRA

Metric	LoRA	Q-LoRA
GPU Memory (7B)	20.1GB	11.5GB
Training Speed	1.2s/iter	3.0s/iter
Model Quality	Higher	Slightly lower
Use Case	Standard training	Memory-constrained

Best Practices

Do’s

Use chat models when possible for lower memory usage
Start with default LoRA config (r=64, alpha=16)
Enable gradient checkpointing for memory savings
Monitor training loss to detect convergence
Save multiple checkpoints for checkpoint selection

Don’ts

Don’t use ZeRO-3 with base model LoRA (embedding trainable)
Don’t use excessively high learning rates (>5e-4)
Don’t skip validation data for complex tasks
Don’t merge adapters for Q-LoRA (not supported)
Don’t forget to copy support files (*.cu, *.cpp) when needed

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

​Overview

​When to Use LoRA

​Hardware Requirements

​Memory Requirements

​GPU Recommendations by Model Size

​Installation

​LoRA Configuration

​Parameter Explanation

​Single-GPU Training

​Basic Training Script

​Running Single-GPU Training

​Multi-GPU Training

​DeepSpeed ZeRO-2 Configuration

​Base Model vs Chat Model

​Fine-tuning Chat Models (Recommended)

​Fine-tuning Base Models

​Loading and Using LoRA Adapters

​Load Adapter for Inference

​Merge Adapter with Base Model

​Switch Between Multiple Adapters

​Hyperparameter Tuning

​Learning Rate

​LoRA Rank (r)

​Batch Size Optimization

​Advanced Techniques

​Custom Target Modules

​LoRA with Custom Tokens

​Quantization After LoRA Training

​Monitoring Training

​TensorBoard Integration

​Weights & Biases Integration

​Troubleshooting

​Performance Comparison

​LoRA vs Full-Parameter (Qwen-7B)

​LoRA vs Q-LoRA

​Best Practices

​Next Steps

Q-LoRA Training

Multi-node Training

Build docs developers (and LLMs) love

Overview

When to Use LoRA

Hardware Requirements

Memory Requirements

GPU Recommendations by Model Size

Installation

LoRA Configuration

Parameter Explanation

Single-GPU Training

Basic Training Script

Running Single-GPU Training

Multi-GPU Training

DeepSpeed ZeRO-2 Configuration

Base Model vs Chat Model

Fine-tuning Chat Models (Recommended)

Fine-tuning Base Models

Loading and Using LoRA Adapters

Load Adapter for Inference

Merge Adapter with Base Model

Switch Between Multiple Adapters

Hyperparameter Tuning

Learning Rate

LoRA Rank (r)

Batch Size Optimization

Advanced Techniques

Custom Target Modules

LoRA with Custom Tokens

Quantization After LoRA Training

Monitoring Training

TensorBoard Integration

Weights & Biases Integration

Troubleshooting

Performance Comparison

LoRA vs Full-Parameter (Qwen-7B)

LoRA vs Q-LoRA

Best Practices

Next Steps