Overview
LoRA achieves efficient fine-tuning by:- Training only 0.1-1% of model parameters
- Keeping original model weights frozen
- Adding trainable low-rank decomposition matrices to attention layers
- Enabling single-GPU training for 7B models
- Allowing multiple adapters for different tasks
When to Use LoRA
Choose LoRA when:- You have single GPU with 16-40GB memory
- You need fast iteration on different tasks
- You want to maintain multiple task-specific adapters
- You need quick deployment without merging weights
- Your task requires moderate adaptation from pretrained behavior
Hardware Requirements
Memory Requirements
Qwen-7B LoRA Fine-tuning (Single A100-80GB):| Sequence Length | LoRA Memory | LoRA (emb) Memory | Speed |
|---|---|---|---|
| 256 | 20.1GB | 33.7GB | 1.2s/iter |
| 512 | 20.4GB | 34.1GB | 1.5s/iter |
| 1024 | 21.5GB | 35.2GB | 2.8s/iter |
| 2048 | 23.8GB | 35.1GB | 5.2s/iter |
| 4096 | 29.7GB | 39.2GB | 10.1s/iter |
| 8192 | 36.6GB | 48.5GB | 21.3s/iter |
LoRA (emb) refers to training with embedding and output layers as trainable parameters, required when fine-tuning base models with new tokens.
GPU Recommendations by Model Size
| Model | Minimum GPU | Recommended GPU | Memory (LoRA) |
|---|---|---|---|
| Qwen-1.8B | RTX 3090 (24GB) | RTX 4090 (24GB) | 6.7GB |
| Qwen-7B | RTX A6000 (48GB) | A100 (40GB/80GB) | 20.1GB |
| Qwen-14B | A100 (40GB) | A100 (80GB) | ~35GB |
| Qwen-72B | A100 (80GB) × 4 | A100 (80GB) × 4 | Requires ZeRO-3 |
Installation
LoRA Configuration
LoRA adds trainable rank decomposition matrices to specific model layers:Parameter Explanation
Rank of the low-rank decomposition matrices. Higher rank = more capacity but more memory.
- r=8: Very efficient, good for simple tasks
- r=16-32: Balanced, suitable for most tasks
- r=64: Higher capacity, recommended default
- r=128: Maximum capacity, for complex tasks
Scaling factor for LoRA updates. Affects learning rate.Scaling = lora_alpha / r
- Common pattern: lora_alpha = r/4 or r/2
- Does not affect trainable parameters
- Adjust if model underfits or overfits
Model layers where LoRA adapters are applied.For Qwen:
["c_attn", "c_proj", "w1", "w2"]c_attn: Attention query, key, value projectionsc_proj: Attention output projectionw1,w2: Feed-forward network layers
Dropout probability for LoRA layers (regularization).
Additional modules to train beyond LoRA adapters.For base models:
["wte", "lm_head"] (embedding and output layers)For chat models: None (not needed)Single-GPU Training
Basic Training Script
finetune/finetune_lora_single_gpu.sh
Running Single-GPU Training
Multi-GPU Training
For faster training or larger models, use distributed LoRA training:finetune/finetune_lora_ds.sh
DeepSpeed ZeRO-2 Configuration
finetune/ds_config_zero2.json
ZeRO-2 shards optimizer states and gradients across GPUs, but keeps model parameters replicated. This is ideal for LoRA since adapter parameters are small.
Base Model vs Chat Model
Key differences when fine-tuning base models vs chat models:Fine-tuning Chat Models (Recommended)
- Lower memory usage (no extra trainable parameters)
- Compatible with DeepSpeed ZeRO-3
- No special handling needed
- Recommended for most use cases
Fine-tuning Base Models
- Automatically enables training of embedding (
wte) and output (lm_head) layers - Required for model to learn ChatML special tokens
- Higher memory usage (~13.6GB extra for Qwen-7B)
- Cannot use ZeRO-3 (must use ZeRO-2)
Loading and Using LoRA Adapters
Load Adapter for Inference
Merge Adapter with Base Model
For deployment, you can merge the adapter into the base model:Switch Between Multiple Adapters
Hyperparameter Tuning
Learning Rate
- Conservative: 1e-4 (safer for base models)
- Standard: 3e-4 (recommended for chat models)
- Aggressive: 5e-4 (fast convergence, watch for instability)
LoRA Rank (r)
Adjust based on task complexity:Batch Size Optimization
| GPU Memory | Batch Size | Grad Accum | Effective Batch |
|---|---|---|---|
| 16GB | 1 | 16 | 16 |
| 24GB | 2 | 8 | 16 |
| 40GB | 4 | 4 | 16 |
| 80GB | 8 | 2 | 16 |
Advanced Techniques
Custom Target Modules
Target specific layers for your use case:LoRA with Custom Tokens
If adding new tokens to vocabulary:Quantization After LoRA Training
Quantize your merged LoRA model for deployment:Monitoring Training
TensorBoard Integration
Weights & Biases Integration
Troubleshooting
PEFT Version Errors
PEFT Version Errors
Issue:
ValueError: Tokenizer class QWenTokenizer does not existSolution: Downgrade PEFTOut of Memory During Training
Out of Memory During Training
Solutions:
- Reduce
per_device_train_batch_sizeto 1 - Reduce
model_max_length(e.g., 512 → 256) - Enable gradient checkpointing:
--gradient_checkpointing - Reduce LoRA rank:
--lora_r 32or--lora_r 16 - Use Q-LoRA instead (see Q-LoRA guide)
Adapter Not Learning (High Loss)
Adapter Not Learning (High Loss)
Possible causes:
- Learning rate too low: Try
--learning_rate 5e-4 - LoRA rank too small: Increase
--lora_r 128 - Data quality issues: Review training samples
- Insufficient training: Increase epochs
DeepSpeed Compatibility Issues
DeepSpeed Compatibility Issues
Issue: ZeRO-3 incompatible with base model LoRASolution: Use ZeRO-2 or switch to chat model
Missing Files After Saving
Missing Files After Saving
Issue:
*.cu and *.cpp files missing from saved adapterSolution: Manually copy from sourcePerformance Comparison
LoRA vs Full-Parameter (Qwen-7B)
| Metric | Full-Parameter | LoRA | Difference |
|---|---|---|---|
| GPU Memory | ~80GB (2 GPUs) | 20.1GB (1 GPU) | 4x reduction |
| Training Speed | 2.5s/iter | 1.2s/iter | 2x faster |
| Trainable Params | 7B (100%) | 70M (1%) | 100x fewer |
| Final Performance | 100% | 90-95% | Minimal loss |
LoRA vs Q-LoRA
| Metric | LoRA | Q-LoRA |
|---|---|---|
| GPU Memory (7B) | 20.1GB | 11.5GB |
| Training Speed | 1.2s/iter | 3.0s/iter |
| Model Quality | Higher | Slightly lower |
| Use Case | Standard training | Memory-constrained |
Best Practices
Do’s
- Use chat models when possible for lower memory usage
- Start with default LoRA config (r=64, alpha=16)
- Enable gradient checkpointing for memory savings
- Monitor training loss to detect convergence
- Save multiple checkpoints for checkpoint selection
Don’ts
- Don’t use ZeRO-3 with base model LoRA (embedding trainable)
- Don’t use excessively high learning rates (>5e-4)
- Don’t skip validation data for complex tasks
- Don’t merge adapters for Q-LoRA (not supported)
- Don’t forget to copy support files (*.cu, *.cpp) when needed
Next Steps
Q-LoRA Training
Further reduce memory with quantization
Multi-node Training
Scale LoRA training across multiple machines