Overview
Full-parameter fine-tuning is recommended when:- You have access to multiple high-memory GPUs (A100 80GB or similar)
- You need maximum model performance for production deployment
- Your task requires significant deviation from the pretrained behavior
- You can afford longer training times and higher computational costs
Hardware Requirements
Memory Requirements by Model Size
| Model | GPUs Required | Memory per GPU | Total GPU Memory |
|---|---|---|---|
| Qwen-1.8B | 1x A100 | 43.5GB | 43.5GB |
| Qwen-7B | 2x A100 | ~40GB | ~80GB |
| Qwen-14B | 4x A100 | ~30GB | ~120GB |
| Qwen-72B | 8x A100 | ~80GB | ~640GB |
Performance Benchmarks
Qwen-1.8B on Single A100-80GB:| Sequence Length | GPU Memory | Speed |
|---|---|---|
| 256 | 43.5GB | 2.1s/iter |
| 512 | 43.5GB | 2.2s/iter |
| 1024 | 43.5GB | 2.2s/iter |
| 2048 | 43.5GB | 2.3s/iter |
| 4096 | 47.1GB | 2.8s/iter |
| 8192 | 48.3GB | 5.6s/iter |
Batch size: 1, Gradient accumulation: 8, Flash Attention 2 enabled, BF16 precision
Installation
Install the required dependencies:Training Configuration
Basic Training Script
Thefinetune/finetune_ds.sh script provides a complete configuration for distributed full-parameter training:
finetune/finetune_ds.sh
Running Training
Prepare Your Data
Monitor Progress
Training logs will show loss and learning rate:Checkpoints are saved to
output_qwen/ every 1000 steps.DeepSpeed Configuration
Full-parameter training uses DeepSpeed ZeRO-3 to distribute model parameters across GPUs:finetune/ds_config_zero3.json
ZeRO-3 Features
- Parameter Sharding: Distributes all model parameters across GPUs
- Gradient Sharding: Distributes gradients across GPUs
- Optimizer State Sharding: Distributes optimizer states across GPUs
- Communication Overlap: Overlaps communication with computation
Hyperparameter Guide
Learning Rate
Recommended Learning Rates
Recommended Learning Rates
- Conservative: 5e-6 (safer, slower convergence)
- Standard: 1e-5 (recommended starting point)
- Aggressive: 2e-5 (faster convergence, risk of instability)
Learning Rate Scheduling
Learning Rate Scheduling
- Cosine decay with 1% warmup steps
- Gradually reduces learning rate over training
- Helps achieve better convergence
Batch Size and Gradient Accumulation
per_device_batch_size × gradient_accumulation_steps × num_gpus
For 2 GPUs: 1 × 16 × 2 = 32 effective batch size
Sequence Length
| Max Length | Memory Impact | Use Case |
|---|---|---|
| 512 | Baseline | Short conversations |
| 1024 | +10-20% | Standard conversations |
| 2048 | +30-50% | Long conversations |
| 4096 | +100%+ | Very long contexts |
| 8192 | +200%+ | Maximum context |
Training Duration
Number of Epochs
Dataset Size Guidelines
Dataset Size Guidelines
- Small (less than 1K samples): 10-20 epochs
- Medium (1K-10K samples): 3-5 epochs
- Large (more than 10K samples): 1-3 epochs
Monitoring Overfitting
Monitoring Overfitting
Watch for these signs of overfitting:
- Training loss continues decreasing while validation loss increases
- Model memorizes training examples verbatim
- Poor generalization to new inputs
- Reduce number of epochs
- Increase dataset size
- Add regularization (weight decay)
Checkpointing
- Saves checkpoint every 1000 steps
- Keeps only the last 10 checkpoints
- Automatically deletes older checkpoints to save disk space
Checkpoint Structure
Advanced Options
Gradient Checkpointing
- Memory savings: 30-50% reduction
- Speed impact: 20-30% slower training
- Recommended: Always enable for full-parameter training
Mixed Precision Training
- Wider dynamic range than FP16
- No loss scaling required
- Consistent with Qwen pretraining
- Requires Ampere GPUs or newer (A100, RTX 30xx+)
Troubleshooting
Out of Memory Errors
Out of Memory Errors
Solutions:
- Reduce
model_max_length - Enable
gradient_checkpointing - Reduce
per_device_train_batch_sizeto 1 - Add more GPUs
- Use DeepSpeed ZeRO-3 with CPU offloading:
Training Divergence (Loss → NaN)
Training Divergence (Loss → NaN)
Causes and solutions:
- Learning rate too high: Reduce to 5e-6
- Gradient explosion: Enable gradient clipping (automatic with DeepSpeed)
- Data quality issues: Check for corrupted samples
- Mixed precision issues: Try BF16 instead of FP16
Slow Training Speed
Slow Training Speed
Optimizations:
- Enable Flash Attention 2
- Use
--lazy_preprocess True - Increase
gradient_accumulation_steps, reducesave_steps - Ensure high-bandwidth inter-GPU connection (NVLink)
- Profile with:
DeepSpeed Initialization Errors
DeepSpeed Initialization Errors
Common fixes:
- Install compatible versions:
torch>=2.0,deepspeed>=0.10 - Check CUDA version compatibility
- Verify all GPUs are accessible:
nvidia-smi - Ensure consistent PyTorch versions across all nodes (for multi-node)
Inference After Training
Load and use your fine-tuned model:Next Steps
LoRA Fine-tuning
Learn about memory-efficient LoRA training
Multi-node Training
Scale to multiple machines for even larger models