Skip to main content
Full-parameter fine-tuning updates all model weights during training, providing the most comprehensive adaptation to your task. This method achieves the best performance but requires significant GPU resources.

Overview

Full-parameter fine-tuning is recommended when:
  • You have access to multiple high-memory GPUs (A100 80GB or similar)
  • You need maximum model performance for production deployment
  • Your task requires significant deviation from the pretrained behavior
  • You can afford longer training times and higher computational costs
Full-parameter fine-tuning of Qwen-7B requires at least 2x A100-80GB GPUs. Single-GPU training will result in out-of-memory errors.

Hardware Requirements

Memory Requirements by Model Size

ModelGPUs RequiredMemory per GPUTotal GPU Memory
Qwen-1.8B1x A10043.5GB43.5GB
Qwen-7B2x A100~40GB~80GB
Qwen-14B4x A100~30GB~120GB
Qwen-72B8x A100~80GB~640GB

Performance Benchmarks

Qwen-1.8B on Single A100-80GB:
Sequence LengthGPU MemorySpeed
25643.5GB2.1s/iter
51243.5GB2.2s/iter
102443.5GB2.2s/iter
204843.5GB2.3s/iter
409647.1GB2.8s/iter
819248.3GB5.6s/iter
Batch size: 1, Gradient accumulation: 8, Flash Attention 2 enabled, BF16 precision

Installation

Install the required dependencies:
# Install base requirements
pip install -r requirements.txt

# Install DeepSpeed for distributed training
pip install deepspeed

# Install Flash Attention 2 (optional but recommended)
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
Flash Attention 2 significantly reduces memory usage and improves training speed. Highly recommended for full-parameter training.

Training Configuration

Basic Training Script

The finetune/finetune_ds.sh script provides a complete configuration for distributed full-parameter training:
finetune/finetune_ds.sh
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1

# Number of GPUs per node
GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')

# Multi-node settings (for single node, use defaults)
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
MASTER_ADDR=${MASTER_ADDR:-localhost}
MASTER_PORT=${MASTER_PORT:-6001}

MODEL="Qwen/Qwen-7B"
DATA="path_to_data.json"

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

torchrun $DISTRIBUTED_ARGS finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --bf16 True \
    --output_dir output_qwen \
    --num_train_epochs 5 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --model_max_length 512 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --deepspeed finetune/ds_config_zero3.json

Running Training

1

Prepare Your Data

Create your training data in the required JSON format:
[
  {
    "id": "sample_1",
    "conversations": [
      {"from": "user", "value": "Your question here"},
      {"from": "assistant", "value": "Expected response here"}
    ]
  }
]
See Data Preparation for details.
2

Launch Training

Run the training script with your model and data paths:
bash finetune/finetune_ds.sh \
  -m Qwen/Qwen-7B \
  -d /path/to/your/data.json
3

Monitor Progress

Training logs will show loss and learning rate:
{'loss': 2.345, 'learning_rate': 1e-05, 'epoch': 0.1}
{'loss': 1.876, 'learning_rate': 9.5e-06, 'epoch': 0.2}
Checkpoints are saved to output_qwen/ every 1000 steps.
4

Load Fine-tuned Model

After training completes, load your model:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "output_qwen",
    device_map="auto",
    trust_remote_code=True
).eval()

tokenizer = AutoTokenizer.from_pretrained(
    "output_qwen",
    trust_remote_code=True
)

response, history = model.chat(tokenizer, "Hello!", history=None)
print(response)

DeepSpeed Configuration

Full-parameter training uses DeepSpeed ZeRO-3 to distribute model parameters across GPUs:
finetune/ds_config_zero3.json
{
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

ZeRO-3 Features

  • Parameter Sharding: Distributes all model parameters across GPUs
  • Gradient Sharding: Distributes gradients across GPUs
  • Optimizer State Sharding: Distributes optimizer states across GPUs
  • Communication Overlap: Overlaps communication with computation
ZeRO-3 enables training of much larger models than would fit on a single GPU, but requires high inter-GPU bandwidth for best performance.

Hyperparameter Guide

Learning Rate

--learning_rate 1e-5
Full-parameter fine-tuning uses a lower learning rate (1e-5) compared to LoRA (3e-4) because all parameters are being updated.
--lr_scheduler_type "cosine" \
--warmup_ratio 0.01
  • Cosine decay with 1% warmup steps
  • Gradually reduces learning rate over training
  • Helps achieve better convergence

Batch Size and Gradient Accumulation

--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16
Effective batch size = per_device_batch_size × gradient_accumulation_steps × num_gpus For 2 GPUs: 1 × 16 × 2 = 32 effective batch size
Increasing per_device_train_batch_size beyond 1 may cause OOM errors. Adjust gradient_accumulation_steps instead.

Sequence Length

--model_max_length 512
Longer sequences require more memory:
Max LengthMemory ImpactUse Case
512BaselineShort conversations
1024+10-20%Standard conversations
2048+30-50%Long conversations
4096+100%+Very long contexts
8192+200%+Maximum context

Training Duration

Number of Epochs

--num_train_epochs 5
  • Small (less than 1K samples): 10-20 epochs
  • Medium (1K-10K samples): 3-5 epochs
  • Large (more than 10K samples): 1-3 epochs
Watch for these signs of overfitting:
  • Training loss continues decreasing while validation loss increases
  • Model memorizes training examples verbatim
  • Poor generalization to new inputs
Solutions:
  • Reduce number of epochs
  • Increase dataset size
  • Add regularization (weight decay)

Checkpointing

--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 10
  • Saves checkpoint every 1000 steps
  • Keeps only the last 10 checkpoints
  • Automatically deletes older checkpoints to save disk space

Checkpoint Structure

output_qwen/
├── checkpoint-1000/
   ├── pytorch_model.bin
   ├── config.json
   ├── trainer_state.json
   └── optimizer.pt
├── checkpoint-2000/
└── ...

Advanced Options

Gradient Checkpointing

--gradient_checkpointing True
Trades computation for memory by recomputing activations during backward pass:
  • Memory savings: 30-50% reduction
  • Speed impact: 20-30% slower training
  • Recommended: Always enable for full-parameter training

Mixed Precision Training

--bf16 True
BF16 advantages:
  • Wider dynamic range than FP16
  • No loss scaling required
  • Consistent with Qwen pretraining
  • Requires Ampere GPUs or newer (A100, RTX 30xx+)

Troubleshooting

Solutions:
  1. Reduce model_max_length
  2. Enable gradient_checkpointing
  3. Reduce per_device_train_batch_size to 1
  4. Add more GPUs
  5. Use DeepSpeed ZeRO-3 with CPU offloading:
"offload_param": {
    "device": "cpu",
    "pin_memory": true
}
Causes and solutions:
  • Learning rate too high: Reduce to 5e-6
  • Gradient explosion: Enable gradient clipping (automatic with DeepSpeed)
  • Data quality issues: Check for corrupted samples
  • Mixed precision issues: Try BF16 instead of FP16
Optimizations:
  1. Enable Flash Attention 2
  2. Use --lazy_preprocess True
  3. Increase gradient_accumulation_steps, reduce save_steps
  4. Ensure high-bandwidth inter-GPU connection (NVLink)
  5. Profile with:
--report_to "tensorboard" \
--logging_dir ./logs
Common fixes:
  • Install compatible versions: torch>=2.0, deepspeed>=0.10
  • Check CUDA version compatibility
  • Verify all GPUs are accessible: nvidia-smi
  • Ensure consistent PyTorch versions across all nodes (for multi-node)

Inference After Training

Load and use your fine-tuned model:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "output_qwen",
    device_map="auto",
    trust_remote_code=True,
    bf16=True
).eval()

tokenizer = AutoTokenizer.from_pretrained(
    "output_qwen",
    trust_remote_code=True
)

# Single-turn conversation
response, history = model.chat(
    tokenizer,
    "What can you help me with?",
    history=None
)
print(response)

# Multi-turn conversation
response, history = model.chat(
    tokenizer,
    "Tell me more about that",
    history=history
)
print(response)

Next Steps

LoRA Fine-tuning

Learn about memory-efficient LoRA training

Multi-node Training

Scale to multiple machines for even larger models

Build docs developers (and LLMs) love