Full-Parameter Fine-tuning

Full-parameter fine-tuning updates all model weights during training, providing the most comprehensive adaptation to your task. This method achieves the best performance but requires significant GPU resources.

Overview

Full-parameter fine-tuning is recommended when:

You have access to multiple high-memory GPUs (A100 80GB or similar)
You need maximum model performance for production deployment
Your task requires significant deviation from the pretrained behavior
You can afford longer training times and higher computational costs

Full-parameter fine-tuning of Qwen-7B requires at least 2x A100-80GB GPUs. Single-GPU training will result in out-of-memory errors.

Hardware Requirements

Memory Requirements by Model Size

Model	GPUs Required	Memory per GPU	Total GPU Memory
Qwen-1.8B	1x A100	43.5GB	43.5GB
Qwen-7B	2x A100	~40GB	~80GB
Qwen-14B	4x A100	~30GB	~120GB
Qwen-72B	8x A100	~80GB	~640GB

Performance Benchmarks

Qwen-1.8B on Single A100-80GB:

Sequence Length	GPU Memory	Speed
256	43.5GB	2.1s/iter
512	43.5GB	2.2s/iter
1024	43.5GB	2.2s/iter
2048	43.5GB	2.3s/iter
4096	47.1GB	2.8s/iter
8192	48.3GB	5.6s/iter

Batch size: 1, Gradient accumulation: 8, Flash Attention 2 enabled, BF16 precision

Installation

Install the required dependencies:

# Install base requirements
pip install -r requirements.txt

# Install DeepSpeed for distributed training
pip install deepspeed

# Install Flash Attention 2 (optional but recommended)
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .

Flash Attention 2 significantly reduces memory usage and improves training speed. Highly recommended for full-parameter training.

Training Configuration

Basic Training Script

The finetune/finetune_ds.sh script provides a complete configuration for distributed full-parameter training:

finetune/finetune_ds.sh

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1

# Number of GPUs per node
GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')

# Multi-node settings (for single node, use defaults)
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
MASTER_ADDR=${MASTER_ADDR:-localhost}
MASTER_PORT=${MASTER_PORT:-6001}

MODEL="Qwen/Qwen-7B"
DATA="path_to_data.json"

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

torchrun $DISTRIBUTED_ARGS finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --bf16 True \
    --output_dir output_qwen \
    --num_train_epochs 5 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --model_max_length 512 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --deepspeed finetune/ds_config_zero3.json

Running Training

Prepare Your Data

Create your training data in the required JSON format:

[
  {
    "id": "sample_1",
    "conversations": [
      {"from": "user", "value": "Your question here"},
      {"from": "assistant", "value": "Expected response here"}
    ]
  }
]

See Data Preparation for details.

Launch Training

Run the training script with your model and data paths:

bash finetune/finetune_ds.sh \
  -m Qwen/Qwen-7B \
  -d /path/to/your/data.json

Monitor Progress

Training logs will show loss and learning rate:

{'loss': 2.345, 'learning_rate': 1e-05, 'epoch': 0.1}
{'loss': 1.876, 'learning_rate': 9.5e-06, 'epoch': 0.2}

Checkpoints are saved to output_qwen/ every 1000 steps.

Load Fine-tuned Model

After training completes, load your model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "output_qwen",
    device_map="auto",
    trust_remote_code=True
).eval()

tokenizer = AutoTokenizer.from_pretrained(
    "output_qwen",
    trust_remote_code=True
)

response, history = model.chat(tokenizer, "Hello!", history=None)
print(response)

DeepSpeed Configuration

Full-parameter training uses DeepSpeed ZeRO-3 to distribute model parameters across GPUs:

finetune/ds_config_zero3.json

{
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

ZeRO-3 Features

Parameter Sharding: Distributes all model parameters across GPUs
Gradient Sharding: Distributes gradients across GPUs
Optimizer State Sharding: Distributes optimizer states across GPUs
Communication Overlap: Overlaps communication with computation

ZeRO-3 enables training of much larger models than would fit on a single GPU, but requires high inter-GPU bandwidth for best performance.

Hyperparameter Guide

Learning Rate

--learning_rate 1e-5

Full-parameter fine-tuning uses a lower learning rate (1e-5) compared to LoRA (3e-4) because all parameters are being updated.

Recommended Learning Rates

Conservative: 5e-6 (safer, slower convergence)
Standard: 1e-5 (recommended starting point)
Aggressive: 2e-5 (faster convergence, risk of instability)

Learning Rate Scheduling

--lr_scheduler_type "cosine" \
--warmup_ratio 0.01

Cosine decay with 1% warmup steps
Gradually reduces learning rate over training
Helps achieve better convergence

Batch Size and Gradient Accumulation

--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16

Effective batch size = per_device_batch_size × gradient_accumulation_steps × num_gpus For 2 GPUs: 1 × 16 × 2 = 32 effective batch size

Increasing per_device_train_batch_size beyond 1 may cause OOM errors. Adjust gradient_accumulation_steps instead.

Sequence Length

--model_max_length 512

Longer sequences require more memory:

Max Length	Memory Impact	Use Case
512	Baseline	Short conversations
1024	+10-20%	Standard conversations
2048	+30-50%	Long conversations
4096	+100%+	Very long contexts
8192	+200%+	Maximum context

Training Duration

Number of Epochs

--num_train_epochs 5

Dataset Size Guidelines

Small (less than 1K samples): 10-20 epochs
Medium (1K-10K samples): 3-5 epochs
Large (more than 10K samples): 1-3 epochs

Monitoring Overfitting

Watch for these signs of overfitting:

Training loss continues decreasing while validation loss increases
Model memorizes training examples verbatim
Poor generalization to new inputs

Solutions:

Reduce number of epochs
Increase dataset size
Add regularization (weight decay)

Checkpointing

--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 10

Saves checkpoint every 1000 steps
Keeps only the last 10 checkpoints
Automatically deletes older checkpoints to save disk space

Checkpoint Structure

output_qwen/
├── checkpoint-1000/
│   ├── pytorch_model.bin
│   ├── config.json
│   ├── trainer_state.json
│   └── optimizer.pt
├── checkpoint-2000/
└── ...

Advanced Options

Gradient Checkpointing

--gradient_checkpointing True

Trades computation for memory by recomputing activations during backward pass:

Memory savings: 30-50% reduction
Speed impact: 20-30% slower training
Recommended: Always enable for full-parameter training

Mixed Precision Training

--bf16 True

BF16 advantages:

Wider dynamic range than FP16
No loss scaling required
Consistent with Qwen pretraining
Requires Ampere GPUs or newer (A100, RTX 30xx+)

Troubleshooting

Out of Memory Errors

Solutions:

Reduce model_max_length
Enable gradient_checkpointing
Reduce per_device_train_batch_size to 1
Add more GPUs
Use DeepSpeed ZeRO-3 with CPU offloading:

"offload_param": {
    "device": "cpu",
    "pin_memory": true
}

Training Divergence (Loss → NaN)

Causes and solutions:

Learning rate too high: Reduce to 5e-6
Gradient explosion: Enable gradient clipping (automatic with DeepSpeed)
Data quality issues: Check for corrupted samples
Mixed precision issues: Try BF16 instead of FP16

Slow Training Speed

Optimizations:

Enable Flash Attention 2
Use --lazy_preprocess True
Increase gradient_accumulation_steps, reduce save_steps
Ensure high-bandwidth inter-GPU connection (NVLink)
Profile with:

--report_to "tensorboard" \
--logging_dir ./logs

DeepSpeed Initialization Errors

Common fixes:

Install compatible versions: torch>=2.0, deepspeed>=0.10
Check CUDA version compatibility
Verify all GPUs are accessible: nvidia-smi
Ensure consistent PyTorch versions across all nodes (for multi-node)

Inference After Training

Load and use your fine-tuned model:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "output_qwen",
    device_map="auto",
    trust_remote_code=True,
    bf16=True
).eval()

tokenizer = AutoTokenizer.from_pretrained(
    "output_qwen",
    trust_remote_code=True
)

# Single-turn conversation
response, history = model.chat(
    tokenizer,
    "What can you help me with?",
    history=None
)
print(response)

# Multi-turn conversation
response, history = model.chat(
    tokenizer,
    "Tell me more about that",
    history=history
)
print(response)

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

Full-Parameter Fine-tuning

Overview

Hardware Requirements

Memory Requirements by Model Size

Performance Benchmarks

Installation

Training Configuration

Basic Training Script

Running Training

DeepSpeed Configuration

ZeRO-3 Features

Hyperparameter Guide

Learning Rate

Batch Size and Gradient Accumulation

Sequence Length

Training Duration

Number of Epochs

Checkpointing

Checkpoint Structure

Advanced Options

Gradient Checkpointing

Mixed Precision Training

Troubleshooting

Inference After Training

Next Steps

LoRA Fine-tuning

Multi-node Training

Build docs developers (and LLMs) love

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

​Overview

​Hardware Requirements

​Memory Requirements by Model Size

​Performance Benchmarks

​Installation

​Training Configuration

​Basic Training Script

​Running Training

​DeepSpeed Configuration

​ZeRO-3 Features

​Hyperparameter Guide

​Learning Rate

​Batch Size and Gradient Accumulation

​Sequence Length

​Training Duration

​Number of Epochs

​Checkpointing

​Checkpoint Structure

​Advanced Options

​Gradient Checkpointing

​Mixed Precision Training

​Troubleshooting

​Inference After Training

​Next Steps

LoRA Fine-tuning

Multi-node Training

Build docs developers (and LLMs) love

Overview

Hardware Requirements

Memory Requirements by Model Size

Performance Benchmarks

Installation

Training Configuration

Basic Training Script

Running Training

DeepSpeed Configuration

ZeRO-3 Features

Hyperparameter Guide

Learning Rate

Batch Size and Gradient Accumulation

Sequence Length

Training Duration

Number of Epochs

Checkpointing

Checkpoint Structure

Advanced Options

Gradient Checkpointing

Mixed Precision Training

Troubleshooting

Inference After Training

Next Steps