Available Fine-tuning Methods
Qwen provides three primary fine-tuning methods, each with different memory requirements and training characteristics:Full-Parameter
Update all model parameters for maximum performance
LoRA
Efficient adapter-based training with low memory usage
Q-LoRA
LoRA on quantized models for minimal GPU requirements
Method Comparison
Choose your fine-tuning approach based on available resources and requirements:| Method | GPU Memory (7B) | Training Speed | Performance | Use Case |
|---|---|---|---|---|
| Full-Parameter | ~43.5GB (2 GPUs) | Moderate | Highest | Production models with ample resources |
| LoRA | ~20.1GB (1 GPU) | Fast | High | Balanced approach for most use cases |
| LoRA (emb) | ~33.7GB (1 GPU) | Fast | High | Fine-tuning base models with new tokens |
| Q-LoRA | ~11.5GB (1 GPU) | Slower | Good | Limited GPU memory scenarios |
Memory statistics are for Qwen-7B with sequence length 256. Requirements increase with longer sequences.
Model Size Considerations
Memory Requirements by Model Size
Minimum GPU memory for Q-LoRA fine-tuning (most memory-efficient method):Qwen-1.8B
Qwen-1.8B
- Q-LoRA: 5.8GB GPU memory
- LoRA: 6.7GB GPU memory
- Full-parameter: 43.5GB GPU memory (single GPU)
- Suitable for consumer GPUs (RTX 3090, 4090)
Qwen-7B
Qwen-7B
- Q-LoRA: 11.5GB GPU memory
- LoRA: 20.1GB GPU memory
- Full-parameter: Requires 2x A100 GPUs
- Recommended for professional workstations
Qwen-14B
Qwen-14B
- Q-LoRA: 18.7GB GPU memory
- LoRA: Requires multiple GPUs or DeepSpeed ZeRO-3
- Full-parameter: Requires 4+ A100 GPUs
- Enterprise-grade hardware required
Qwen-72B
Qwen-72B
- Q-LoRA: 61.4GB GPU memory (A100-80GB)
- LoRA + DeepSpeed ZeRO-3: 4x A100-80GB GPUs
- Full-parameter: Requires 8+ A100 GPUs
- Large-scale training infrastructure needed
Key Features
Training Framework Support
All fine-tuning methods support:- DeepSpeed: Distributed training with ZeRO optimization (stages 2 and 3)
- FSDP: Fully Sharded Data Parallel (alternative to DeepSpeed)
- Flash Attention 2: Accelerated training and reduced memory usage
- Gradient Checkpointing: Trade computation for memory savings
Supported Precision
Training Script Overview
Qwen provides production-ready training scripts in thefinetune/ directory:
Data Format
All fine-tuning methods use the same JSON conversation format:Quick Start
Prepare Training Data
Create your training data in JSON format following the conversation structure above.
Choose Fine-tuning Method
Select the appropriate method based on your GPU memory and requirements:
- Limited GPU memory (< 12GB): Use Q-LoRA with smaller models
- Single GPU (16-40GB): Use LoRA
- Multiple GPUs: Use LoRA or Full-parameter with DeepSpeed
Performance Benchmarks
Qwen-7B Fine-tuning Performance (Single A100-80GB)
| Sequence Length | LoRA Memory | LoRA Speed | Q-LoRA Memory | Q-LoRA Speed |
|---|---|---|---|---|
| 256 | 20.1GB | 1.2s/iter | 11.5GB | 3.0s/iter |
| 512 | 20.4GB | 1.5s/iter | 11.5GB | 3.0s/iter |
| 1024 | 21.5GB | 2.8s/iter | 12.3GB | 3.5s/iter |
| 2048 | 23.8GB | 5.2s/iter | 13.9GB | 7.0s/iter |
| 4096 | 29.7GB | 10.1s/iter | 16.9GB | 11.6s/iter |
| 8192 | 36.6GB | 21.3s/iter | 23.5GB | 22.3s/iter |
Batch size: 1, Gradient accumulation: 8, Flash Attention 2 enabled
Special Considerations
Base Model vs Chat Model
When fine-tuning base models (e.g., Qwen-7B) with LoRA:- The embedding (
wte) and output (lm_head) layers are automatically set as trainable - This is necessary for the model to learn ChatML format tokens
- Requires more memory than fine-tuning chat models
- Cannot use DeepSpeed ZeRO-3 with trainable embeddings
- No additional trainable parameters needed
- Lower memory requirements
- Compatible with DeepSpeed ZeRO-3
Next Steps
Data Preparation
Learn how to prepare high-quality training data
Multi-node Training
Scale training across multiple machines