This guide covers everything you need to install and configure Qwen models, from basic dependencies to advanced optimizations.
System Requirements
Minimum Requirements
Python Python 3.8 or higher
PyTorch PyTorch 1.12+ (2.0+ recommended)
CUDA CUDA 11.4+ (for GPU users)
GPU Memory Varies by model size (see table below)
GPU Memory Requirements
Minimum GPU memory needed for inference (generating 2048 tokens):
Model BF16/FP16 Int8 Int4 Qwen-1.8B 4.23GB 3.48GB 2.91GB Qwen-7B 16.99GB 11.20GB 8.21GB Qwen-14B 30.15GB 18.81GB 13.01GB Qwen-72B 144.69GB (2xA100) 81.27GB (2xA100) 48.86GB
For fine-tuning, memory requirements are higher. Q-LoRA requires minimum:
Qwen-1.8B: 5.8GB
Qwen-7B: 11.5GB
Qwen-14B: 18.7GB
Qwen-72B: 61.4GB
Basic Installation
Step 1: Install Core Dependencies
Install the required Python packages from the requirements file:
pip install transformer s > =4.32.0, < 4.38.0 accelerate tiktoken einops transformers_stream_generator== 0.0.4 scipy
Understanding the Dependencies
transformers : Hugging Face library for loading and running models
accelerate : Efficient model loading and distributed inference
tiktoken : Fast tokenization library
einops : Tensor operations for attention mechanisms
transformers_stream_generator : Streaming text generation support
scipy : Scientific computing utilities
Step 2: Verify Installation
Test your installation with this simple script:
import torch
import transformers
from transformers import AutoTokenizer
print ( f "PyTorch version: { torch. __version__ } " )
print ( f "Transformers version: { transformers. __version__ } " )
print ( f "CUDA available: { torch.cuda.is_available() } " )
if torch.cuda.is_available():
print ( f "CUDA version: { torch.version.cuda } " )
print ( f "GPU: { torch.cuda.get_device_name( 0 ) } " )
Flash Attention (Recommended)
Flash Attention significantly improves inference speed and reduces memory usage. Installation is optional but highly recommended.
Check Compatibility
Flash Attention requires:
GPU with FP16 or BF16 support
CUDA 11.4 or higher
PyTorch 1.12 or higher
Verify your setup: import torch
print ( f "BF16 supported: { torch.cuda.is_bf16_supported() } " )
print ( f "FP16 supported: { torch.cuda.get_device_capability()[ 0 ] >= 7 } " )
Install Flash Attention
Qwen supports Flash Attention 2 for optimal performance: git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
pip install .
This installation may take 10-30 minutes as it compiles CUDA kernels. Ensure you have sufficient disk space (~5GB for build files).
Install Optional Components (Flash Attention v2.1.1 and below)
For older versions of Flash Attention, you may optionally install additional components: # Optional: Layer norm optimization
# pip install csrc/layer_norm
# Optional: Rotary embedding optimization (not needed for flash-attn > 2.1.1)
# pip install csrc/rotary
These are optional and may slow down the installation. Skip if flash-attention version is higher than 2.1.1.
Verify Flash Attention
Test that Flash Attention is working: try :
import flash_attn
print ( f "Flash Attention version: { flash_attn. __version__ } " )
print ( "Flash Attention installed successfully!" )
except ImportError :
print ( "Flash Attention not available" )
With Flash Attention enabled, you can expect:
40% faster batch inference
20-30% lower memory usage
Support for longer sequences without OOM errors
Docker Installation
Using Docker is the fastest way to get started with Qwen, as it includes all dependencies pre-configured.
Pre-built Docker Images
Qwen provides official Docker images that skip most environment setup steps:
# Pull the official Qwen Docker image
docker pull qwenllm/qwen:latest
# Run the container with GPU support
docker run --gpus all -it qwenllm/qwen:latest
Custom Dockerfile
If you need a custom setup, create your own Dockerfile:
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04
# Install Python and pip
RUN apt-get update && apt-get install -y \
python3.10 \
python3-pip \
git \
&& rm -rf /var/lib/apt/lists/*
# Install dependencies
RUN pip3 install --no-cache-dir \
transformers>=4.32.0,<4.38.0 \
accelerate \
tiktoken \
einops \
transformers_stream_generator==0.0.4 \
scipy \
torch>=2.0.0
# Optional: Install Flash Attention
RUN git clone https://github.com/Dao-AILab/flash-attention && \
cd flash-attention && \
pip install . && \
cd .. && \
rm -rf flash-attention
# Set working directory
WORKDIR /workspace
CMD [ "/bin/bash" ]
Build and run:
docker build -t qwen-custom .
docker run --gpus all -it -v $( pwd ) :/workspace qwen-custom
Quantization Dependencies
To use quantized models (Int4/Int8), install additional libraries:
AutoGPTQ Installation
pip install auto-gptq optimum
Version Compatibility : AutoGPTQ packages are highly dependent on your PyTorch and CUDA versions. If you encounter installation issues:
For PyTorch 2.1: auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1
For PyTorch 2.0: auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0
If pre-compiled wheels don’t work, build from source: git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip install -e .
Verify Quantization Support
try :
from auto_gptq import AutoGPTQForCausalLM
print ( "AutoGPTQ installed successfully!" )
except ImportError as e:
print ( f "AutoGPTQ not available: { e } " )
Fine-tuning Dependencies
For training and fine-tuning, install additional packages:
LoRA and Q-LoRA
peft>=0.8.0 has a known issue with loading Qwen tokenizers. Use peft<0.8.0 until the issue is resolved.
DeepSpeed (for distributed training)
Pydantic Compatibility : DeepSpeed may conflict with pydantic>=2.0. If you encounter errors, ensure pydantic<2.0:pip install "pydantic<2.0"
Full Fine-tuning Requirements
# Install all training dependencies
pip install "peft<0.8.0" deepspeed "pydantic<2.0" tensorboard
For Intel Core/Xeon processors or Arc GPUs, use OpenVINO for optimized inference:
pip install openvino openvino-dev
See the OpenVINO notebooks for Qwen-specific examples.
Ascend NPU
For Huawei Ascend 910 NPU:
# Install Ascend toolkit first, then:
pip install torch-npu
Refer to the ascend-support directory in the Qwen repository for detailed instructions.
Hygon DCU
For Hygon DCU acceleration:
# Follow DCU-specific installation
# See dcu-support directory for details
Installing from Source
To get the latest development version or contribute to Qwen:
Clone the Repository
git clone https://github.com/QwenLM/Qwen.git
cd Qwen
Install Dependencies
pip install -r requirements.txt
Run from Source
# You can now import and use local model files
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load from local directory
model_path = "./Qwen-7B-Chat" # Your local model path
tokenizer = AutoTokenizer.from_pretrained(
model_path,
trust_remote_code = True
)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map = "auto" ,
trust_remote_code = True
).eval()
Downloading Models
From Hugging Face
Models are automatically downloaded when you first load them:
from transformers import AutoModelForCausalLM, AutoTokenizer
# This will download from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained(
"Qwen/Qwen-7B-Chat" ,
trust_remote_code = True
)
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat" ,
device_map = "auto" ,
trust_remote_code = True
).eval()
From ModelScope
For users with better access to ModelScope:
from modelscope import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
# Download to local directory
model_dir = snapshot_download( 'qwen/Qwen-7B-Chat' )
# Load from local directory
tokenizer = AutoTokenizer.from_pretrained(
model_dir,
trust_remote_code = True
)
model = AutoModelForCausalLM.from_pretrained(
model_dir,
device_map = "auto" ,
trust_remote_code = True
).eval()
Manual Download
You can also manually download model files:
Using Git LFS
Using Hugging Face CLI
# Install git-lfs
git lfs install
# Clone the model repository
git clone https://huggingface.co/Qwen/Qwen-7B-Chat
# Install huggingface-cli
pip install huggingface_hub
# Download model
huggingface-cli download Qwen/Qwen-7B-Chat --local-dir ./Qwen-7B-Chat
Environment Variables
Configure these environment variables for optimal performance:
# Set cache directory for models
export HF_HOME = / path / to / cache
export TRANSFORMERS_CACHE = / path / to / cache
# Enable offline mode (use cached models only)
export HF_DATASETS_OFFLINE = 1
export TRANSFORMERS_OFFLINE = 1
# Set number of threads for PyTorch
export OMP_NUM_THREADS = 8
# CUDA optimizations
export CUDA_LAUNCH_BLOCKING = 0
Verification
Run this complete verification script to ensure everything is working:
import sys
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
def verify_installation ():
print ( "=" * 50 )
print ( "Qwen Installation Verification" )
print ( "=" * 50 )
# Python version
print ( f " \n Python version: { sys.version } " )
# PyTorch
print ( f "PyTorch version: { torch. __version__ } " )
print ( f "CUDA available: { torch.cuda.is_available() } " )
if torch.cuda.is_available():
print ( f "CUDA version: { torch.version.cuda } " )
print ( f "cuDNN version: { torch.backends.cudnn.version() } " )
print ( f "Number of GPUs: { torch.cuda.device_count() } " )
for i in range (torch.cuda.device_count()):
print ( f " GPU { i } : { torch.cuda.get_device_name(i) } " )
print ( f " Memory: { torch.cuda.get_device_properties(i).total_memory / 1024 ** 3 :.2f} GB" )
# Transformers
print ( f " \n Transformers version: { transformers. __version__ } " )
# Flash Attention
try :
import flash_attn
print ( f "Flash Attention version: { flash_attn. __version__ } " )
except ImportError :
print ( "Flash Attention: Not installed" )
# AutoGPTQ
try :
from auto_gptq import AutoGPTQForCausalLM
print ( "AutoGPTQ: Installed" )
except ImportError :
print ( "AutoGPTQ: Not installed" )
# PEFT
try :
import peft
print ( f "PEFT version: { peft. __version__ } " )
except ImportError :
print ( "PEFT: Not installed" )
# DeepSpeed
try :
import deepspeed
print ( f "DeepSpeed version: { deepspeed. __version__ } " )
except ImportError :
print ( "DeepSpeed: Not installed" )
print ( " \n " + "=" * 50 )
print ( "Verification complete!" )
print ( "=" * 50 )
if __name__ == "__main__" :
verify_installation()
Save this as verify_installation.py and run:
python verify_installation.py
Troubleshooting
Installation fails with CUDA errors
Ensure your CUDA version matches PyTorch requirements: # Check CUDA version
nvcc --version
# Install matching PyTorch version
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Flash Attention compilation fails
Common solutions:
Ensure you have CUDA development tools: sudo apt-get install cuda-toolkit-11-8
Update your GCC compiler: sudo apt-get install build-essential
Set environment variables:
export CUDA_HOME = / usr / local / cuda
export PATH = $CUDA_HOME / bin : $PATH
export LD_LIBRARY_PATH = $CUDA_HOME / lib64 : $LD_LIBRARY_PATH
Try installing from PyPI: pip install flash-attn --no-build-isolation
AutoGPTQ version conflicts
If you see errors about incompatible versions: # Uninstall existing versions
pip uninstall auto-gptq optimum transformers peft
# Install compatible versions
pip install torch== 2.1.0
pip install auto-gpt q > = 0.5.1 transformer s > = 4.35.0 optimu m > = 1.14.0 pef t > = 0.6.1
ImportError: trust_remote_code
Update transformers to a compatible version: pip install "transformers>=4.32.0,<4.38.0"
Out of disk space during installation
Flash Attention compilation requires ~5GB temporary space: # Set temporary directory to location with more space
export TMPDIR = / path / to / large / tmp
pip install flash-attn
Model download is slow or fails
Try these solutions:
Use ModelScope instead of Hugging Face (see above)
Set a mirror:
export HF_ENDPOINT = https :// hf-mirror . com
Download manually with git-lfs or huggingface-cli
Resume interrupted downloads by re-running the same command
Next Steps
Quickstart Get started with Qwen in under 5 minutes
Model Selection Choose the right model for your use case
Inference Learn about inference options and optimizations
Docker Setup Deploy Qwen with Docker in production