Overview
Q-LoRA achieves extreme memory efficiency through:- 4-bit NormalFloat quantization of base model weights
- 16-bit LoRA adapters for maintained training quality
- Paged optimizers to handle memory spikes
- Single GPU training of 7B models on 12GB GPUs
- Minimal performance degradation compared to full LoRA
When to Use Q-LoRA
Choose Q-LoRA when:- You have limited GPU memory (12-24GB)
- You need cost-effective fine-tuning on consumer hardware
- You can tolerate 2-3x slower training than regular LoRA
- Your task has moderate quality requirements
- You want to fine-tune larger models on smaller GPUs
Hardware Requirements
Memory Requirements
Qwen-7B Q-LoRA Fine-tuning (Single GPU):| Sequence Length | GPU Memory | Training Speed | Comparison to LoRA |
|---|---|---|---|
| 256 | 11.5GB | 3.0s/iter | 45% less memory |
| 512 | 11.5GB | 3.0s/iter | 44% less memory |
| 1024 | 12.3GB | 3.5s/iter | 43% less memory |
| 2048 | 13.9GB | 7.0s/iter | 42% less memory |
| 4096 | 16.9GB | 11.6s/iter | 43% less memory |
| 8192 | 23.5GB | 22.3s/iter | 36% less memory |
GPU Recommendations
| Model | Minimum GPU | Consumer GPU | Professional GPU |
|---|---|---|---|
| Qwen-1.8B | GTX 1080 Ti (11GB) | RTX 3060 (12GB) | RTX A4000 (16GB) |
| Qwen-7B | RTX 3060 (12GB) | RTX 3090 (24GB) | RTX A5000 (24GB) |
| Qwen-14B | RTX 3090 (24GB) | RTX 4090 (24GB) | A100 (40GB) |
| Qwen-72B | A100 (80GB) | A100 (80GB) | A100 (80GB) |
Minimum GPU requirements assume sequence length ≤ 1024. Longer sequences require more memory.
Installation
Version Compatibility Matrix
| PyTorch | AutoGPTQ | Transformers | Optimum | PEFT |
|---|---|---|---|---|
| 2.1.x | >=0.5.1 | >=4.35.0 | >=1.14.0 | >=0.6.1 |
| 2.0.x | <0.5.0 | <4.35.0 | <1.14.0 | >=0.5.0,<0.6.0 |
Q-LoRA Configuration
Q-LoRA configuration in the training script:Key Parameters
Quantization bit-width. Fixed at 4-bit for Q-LoRA.
Disables ExLlama kernels for compatibility with training.
Single-GPU Training
Basic Training Script
finetune/finetune_qlora_single_gpu.sh
Running Single-GPU Q-LoRA
Prepare Quantized Model
Use official Int4 quantized models:
Only Chat models are available in Int4. Base models are not provided in quantized format.
Multi-GPU Training
For faster Q-LoRA training:finetune/finetune_qlora_ds.sh
Loading Q-LoRA Adapters
Inference with Q-LoRA Adapter
Q-LoRA Constraints
What You Cannot Do
Cannot Merge Adapters
Cannot Merge Adapters
Unlike regular LoRA, Q-LoRA adapters cannot be merged:Reason: Base model is quantized (4-bit), LoRA adapters are FP16. Merging requires same precision.Workaround: Always load adapter separately for inference.
Cannot Train Embedding Layers
Cannot Train Embedding Layers
Q-LoRA with Int4 models cannot make embedding/output layers trainable:Impact: Cannot add new tokens during Q-LoRA training.Solution: Use regular LoRA if you need to add custom tokens.
Must Use Int4 Chat Models
Must Use Int4 Chat Models
Q-LoRA requires official Int4 quantized chat models:
- ✓
Qwen/Qwen-7B-Chat-Int4(supported) - ✗
Qwen/Qwen-7B-Int4(does not exist) - ✗
Qwen/Qwen-7B(cannot be used directly)
Cannot Use BF16
Cannot Use BF16
Q-LoRA training must use FP16, not BF16:Reason: AutoGPTQ quantization is optimized for FP16 operations.
Performance Considerations
Q-LoRA vs LoRA Comparison
Qwen-7B Training (Sequence Length 1024):| Metric | LoRA | Q-LoRA | Difference |
|---|---|---|---|
| GPU Memory | 21.5GB | 12.3GB | 43% reduction |
| Training Speed | 2.8s/iter | 3.5s/iter | 25% slower |
| Trainable Params | 70M | 70M | Same |
| Model Quality | 100% | 95-98% | Slight degradation |
Speed-Memory Tradeoff
Q-LoRA trades speed for memory:- 2-3x slower than regular LoRA
- 40-50% less memory than regular LoRA
- Ideal when memory is the bottleneck
- Use Flash Attention 2 (if compatible)
- Enable gradient checkpointing
- Use
--lazy_preprocess True - Increase
gradient_accumulation_stepsto reduce step overhead
Hyperparameter Guide
Learning Rate
- Too high: Training loss oscillates or diverges
- Too low: Slow convergence, model doesn’t adapt
LoRA Configuration
Batch Size for Memory Constraints
If hitting memory limits:Sequence Length Optimization
| GPU Memory | Recommended Max Length | Batch Size |
|---|---|---|
| 12GB | 512 | 1 |
| 16GB | 1024 | 1 |
| 24GB | 2048 | 1-2 |
| 40GB+ | 4096+ | 2-4 |
Creating Custom Quantized Models
If you need to quantize a fine-tuned model:See Full-Parameter Fine-tuning for detailed quantization instructions.
Model Quality
Benchmark Results
Qwen-7B-Chat Performance:| Quantization | MMLU | C-Eval | GSM8K | HumanEval |
|---|---|---|---|---|
| BF16 (baseline) | 55.8 | 59.7 | 50.3 | 37.2 |
| Int8 | 55.4 | 59.4 | 48.3 | 34.8 |
| Int4 (Q-LoRA) | 55.1 | 59.2 | 49.7 | 29.9 |
When Quality Matters
Q-LoRA is suitable for:
- Domain adaptation
- Style transfer
- Instruction following
- Task-specific fine-tuning
- RAG applications
- Mathematical reasoning (use LoRA or full fine-tuning)
- Complex code generation
- Tasks requiring maximum accuracy
- Production models with strict quality requirements
Troubleshooting
AutoGPTQ Installation Failed
AutoGPTQ Installation Failed
Issue: Cannot install auto-gptq or compilation errorsSolutions:
- Use pre-compiled wheels:
- Check CUDA version compatibility:
- Install build dependencies:
Out of Memory on 12GB GPU
Out of Memory on 12GB GPU
Solutions:
- Reduce sequence length:
- Reduce batch size:
- Reduce LoRA rank:
- Use smaller model:
Training Extremely Slow
Training Extremely Slow
Expected: Q-LoRA is 2-3x slower than LoRAOptimizations:
- Increase gradient accumulation (reduces overhead):
- Use lazy preprocessing:
- Reduce logging frequency:
- Disable evaluation:
Cannot Load Quantized Model
Cannot Load Quantized Model
Issue:
KeyError or missing files when loading Int4 modelSolutions:- Verify model is Int4 quantized:
- Install required packages:
- Copy missing files manually:
Loss Not Decreasing
Loss Not Decreasing
Debugging steps:
- Verify data quality:
- Increase learning rate:
- Increase LoRA rank:
- Train for more epochs:
Advanced: Manual Quantization Configuration
For custom quantization settings:Best Practices
Do’s
- Use official Int4 chat models for Q-LoRA
- Always use FP16 precision, never BF16
- Enable gradient checkpointing for memory savings
- Use DeepSpeed even for single-GPU training
- Monitor GPU memory usage during training
- Start with shorter sequences (512 tokens)
Don’ts
- Don’t try to merge Q-LoRA adapters (not supported)
- Don’t use Q-LoRA if you need to add custom tokens
- Don’t expect same speed as regular LoRA
- Don’t use Q-LoRA for production models if quality is critical
- Don’t use base models with Q-LoRA (embedding layers need training)
Next Steps
LoRA Fine-tuning
Compare with regular LoRA for better quality
Multi-node Training
Scale Q-LoRA training across multiple machines