Skip to main content
Qwen supports multiple fine-tuning approaches to adapt the pretrained models to your specific tasks and domains. This guide provides an overview of the available methods and helps you choose the right approach for your use case.

Available Fine-tuning Methods

Qwen provides three primary fine-tuning methods, each with different memory requirements and training characteristics:

Full-Parameter

Update all model parameters for maximum performance

LoRA

Efficient adapter-based training with low memory usage

Q-LoRA

LoRA on quantized models for minimal GPU requirements

Method Comparison

Choose your fine-tuning approach based on available resources and requirements:
MethodGPU Memory (7B)Training SpeedPerformanceUse Case
Full-Parameter~43.5GB (2 GPUs)ModerateHighestProduction models with ample resources
LoRA~20.1GB (1 GPU)FastHighBalanced approach for most use cases
LoRA (emb)~33.7GB (1 GPU)FastHighFine-tuning base models with new tokens
Q-LoRA~11.5GB (1 GPU)SlowerGoodLimited GPU memory scenarios
Memory statistics are for Qwen-7B with sequence length 256. Requirements increase with longer sequences.

Model Size Considerations

Memory Requirements by Model Size

Minimum GPU memory for Q-LoRA fine-tuning (most memory-efficient method):
  • Q-LoRA: 5.8GB GPU memory
  • LoRA: 6.7GB GPU memory
  • Full-parameter: 43.5GB GPU memory (single GPU)
  • Suitable for consumer GPUs (RTX 3090, 4090)
  • Q-LoRA: 11.5GB GPU memory
  • LoRA: 20.1GB GPU memory
  • Full-parameter: Requires 2x A100 GPUs
  • Recommended for professional workstations
  • Q-LoRA: 18.7GB GPU memory
  • LoRA: Requires multiple GPUs or DeepSpeed ZeRO-3
  • Full-parameter: Requires 4+ A100 GPUs
  • Enterprise-grade hardware required
  • Q-LoRA: 61.4GB GPU memory (A100-80GB)
  • LoRA + DeepSpeed ZeRO-3: 4x A100-80GB GPUs
  • Full-parameter: Requires 8+ A100 GPUs
  • Large-scale training infrastructure needed

Key Features

Training Framework Support

All fine-tuning methods support:
  • DeepSpeed: Distributed training with ZeRO optimization (stages 2 and 3)
  • FSDP: Fully Sharded Data Parallel (alternative to DeepSpeed)
  • Flash Attention 2: Accelerated training and reduced memory usage
  • Gradient Checkpointing: Trade computation for memory savings

Supported Precision

# Full-parameter and LoRA support BF16
--bf16 True
Q-LoRA must use FP16 due to AutoGPTQ quantization requirements. Full-parameter and LoRA can use either BF16 or FP16, but BF16 is recommended for consistency with pretraining.

Training Script Overview

Qwen provides production-ready training scripts in the finetune/ directory:
finetune/
├── finetune.py                      # Main training script
├── finetune_ds.sh                   # Full-parameter (multi-GPU)
├── finetune_lora_single_gpu.sh      # LoRA (single GPU)
├── finetune_lora_ds.sh              # LoRA (multi-GPU/multi-node)
├── finetune_qlora_single_gpu.sh     # Q-LoRA (single GPU)
├── finetune_qlora_ds.sh             # Q-LoRA (multi-GPU)
├── ds_config_zero2.json             # DeepSpeed ZeRO-2 config
└── ds_config_zero3.json             # DeepSpeed ZeRO-3 config

Data Format

All fine-tuning methods use the same JSON conversation format:
[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "user",
        "value": "你好"
      },
      {
        "from": "assistant",
        "value": "我是一个语言模型,我叫通义千问。"
      }
    ]
  }
]
The training script automatically applies the ChatML format with system prompts. You only need to provide the user and assistant messages.

Quick Start

1

Install Dependencies

Install required packages for your chosen method:
# For all methods
pip install -r requirements.txt

# For LoRA and Q-LoRA
pip install "peft<0.8.0" deepspeed

# For Q-LoRA quantization
pip install auto-gptq optimum
2

Prepare Training Data

Create your training data in JSON format following the conversation structure above.
3

Choose Fine-tuning Method

Select the appropriate method based on your GPU memory and requirements:
  • Limited GPU memory (< 12GB): Use Q-LoRA with smaller models
  • Single GPU (16-40GB): Use LoRA
  • Multiple GPUs: Use LoRA or Full-parameter with DeepSpeed
4

Launch Training

Run the corresponding training script with your model and data paths.

Performance Benchmarks

Qwen-7B Fine-tuning Performance (Single A100-80GB)

Sequence LengthLoRA MemoryLoRA SpeedQ-LoRA MemoryQ-LoRA Speed
25620.1GB1.2s/iter11.5GB3.0s/iter
51220.4GB1.5s/iter11.5GB3.0s/iter
102421.5GB2.8s/iter12.3GB3.5s/iter
204823.8GB5.2s/iter13.9GB7.0s/iter
409629.7GB10.1s/iter16.9GB11.6s/iter
819236.6GB21.3s/iter23.5GB22.3s/iter
Batch size: 1, Gradient accumulation: 8, Flash Attention 2 enabled

Special Considerations

Base Model vs Chat Model

When fine-tuning base models (e.g., Qwen-7B) with LoRA:
  • The embedding (wte) and output (lm_head) layers are automatically set as trainable
  • This is necessary for the model to learn ChatML format tokens
  • Requires more memory than fine-tuning chat models
  • Cannot use DeepSpeed ZeRO-3 with trainable embeddings
Chat models (e.g., Qwen-7B-Chat) already understand ChatML format:
  • No additional trainable parameters needed
  • Lower memory requirements
  • Compatible with DeepSpeed ZeRO-3
For memory-constrained scenarios, prefer fine-tuning chat models with Q-LoRA rather than base models with LoRA.

Next Steps

Data Preparation

Learn how to prepare high-quality training data

Multi-node Training

Scale training across multiple machines

Build docs developers (and LLMs) love