Skip to main content

Overview

Qwen fine-tuning uses Hugging Face Transformers’ TrainingArguments with additional custom arguments for model loading, data processing, and LoRA training.

Argument Classes

Four argument classes configure different aspects of training:
from transformers import HfArgumentParser
from dataclasses import dataclass, field

parser = HfArgumentParser((
    ModelArguments,
    DataArguments,
    TrainingArguments,
    LoraArguments
))

model_args, data_args, training_args, lora_args = parser.parse_args_into_dataclasses()

ModelArguments

Specify which model to fine-tune.
model_name_or_path
str
default:"Qwen/Qwen-7B"
Hugging Face model ID or local path to model checkpoint:
--model_name_or_path Qwen/Qwen-7B-Chat

DataArguments

Configure training and evaluation data.
data_path
str
required
Path to training data JSON file:
--data_path ./data/train.json
eval_data_path
str
default:"None"
Path to evaluation data JSON file (optional):
--eval_data_path ./data/eval.json
lazy_preprocess
bool
default:"False"
Use lazy data loading to reduce memory usage:
--lazy_preprocess
Enable for very large datasets that don’t fit in memory.

TrainingArguments

Extends standard Hugging Face TrainingArguments with Qwen-specific options.

Core Training Parameters

output_dir
str
required
Directory for saving model checkpoints and outputs:
--output_dir ./output/qwen-finetuned
num_train_epochs
int
default:"3"
Number of training epochs:
--num_train_epochs 5
per_device_train_batch_size
int
default:"8"
Batch size per GPU during training:
--per_device_train_batch_size 4
per_device_eval_batch_size
int
default:"8"
Batch size per GPU during evaluation:
--per_device_eval_batch_size 8
gradient_accumulation_steps
int
default:"1"
Number of steps to accumulate gradients before updating:
--gradient_accumulation_steps 4
Effective batch size = batch_size × gradient_accumulation_steps × num_gpus

Learning Rate

learning_rate
float
default:"5e-5"
Initial learning rate:
--learning_rate 1e-4
lr_scheduler_type
str
default:"linear"
Learning rate schedule:
  • linear: Linear decay
  • cosine: Cosine annealing
  • constant: No decay
--lr_scheduler_type cosine
warmup_steps
int
default:"0"
Number of warmup steps:
--warmup_steps 100
warmup_ratio
float
default:"0.0"
Warmup ratio (alternative to warmup_steps):
--warmup_ratio 0.1

Optimization

optim
str
default:"adamw_torch"
Optimizer to use:
  • adamw_torch: PyTorch AdamW
  • adamw_hf: Hugging Face AdamW
  • adafactor: Adafactor (memory efficient)
--optim adamw_torch
weight_decay
float
default:"0.0"
Weight decay coefficient:
--weight_decay 0.01
adam_beta1
float
default:"0.9"
Adam beta1 parameter
adam_beta2
float
default:"0.999"
Adam beta2 parameter
max_grad_norm
float
default:"1.0"
Maximum gradient norm for clipping:
--max_grad_norm 1.0

Model Configuration

model_max_length
int
default:"8192"
Maximum sequence length (input + output):
--model_max_length 4096
use_lora
bool
default:"False"
Enable LoRA fine-tuning:
--use_lora
cache_dir
str
default:"None"
Directory for caching downloaded models:
--cache_dir ./cache

Checkpointing

save_strategy
str
default:"steps"
When to save checkpoints:
  • steps: Every save_steps
  • epoch: Every epoch
  • no: No saving
--save_strategy steps
save_steps
int
default:"500"
Save checkpoint every N steps:
--save_steps 200
save_total_limit
int
default:"None"
Maximum number of checkpoints to keep:
--save_total_limit 3

Evaluation

evaluation_strategy
str
default:"no"
When to run evaluation:
  • steps: Every eval_steps
  • epoch: Every epoch
  • no: No evaluation
--evaluation_strategy steps
eval_steps
int
default:"None"
Evaluate every N steps:
--eval_steps 100

Logging

logging_steps
int
default:"500"
Log metrics every N steps:
--logging_steps 10
logging_dir
str
default:"None"
TensorBoard log directory:
--logging_dir ./logs
report_to
str | list
default:"all"
Reporting integrations:
  • tensorboard
  • wandb
  • none
--report_to tensorboard

Performance

fp16
bool
default:"False"
Use FP16 mixed precision:
--fp16
bf16
bool
default:"False"
Use BF16 mixed precision (recommended for modern GPUs):
--bf16
gradient_checkpointing
bool
default:"False"
Enable gradient checkpointing to reduce memory:
--gradient_checkpointing
deepspeed
str
default:"None"
Path to DeepSpeed config file:
--deepspeed ./ds_config.json

Distributed Training

local_rank
int
default:"-1"
Local rank for distributed training (set automatically)
ddp_find_unused_parameters
bool
default:"False"
Find unused parameters in DDP:
--ddp_find_unused_parameters

Complete Example

python finetune.py \
  --model_name_or_path Qwen/Qwen-7B \
  --data_path ./data/train.json \
  --eval_data_path ./data/eval.json \
  --output_dir ./output/qwen-finetuned \
  --num_train_epochs 3 \
  --per_device_train_batch_size 2 \
  --per_device_eval_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --evaluation_strategy steps \
  --eval_steps 100 \
  --save_strategy steps \
  --save_steps 200 \
  --save_total_limit 3 \
  --learning_rate 1e-4 \
  --weight_decay 0.01 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type cosine \
  --logging_steps 10 \
  --model_max_length 2048 \
  --gradient_checkpointing \
  --bf16 \
  --deepspeed ds_config.json

Build docs developers (and LLMs) love