Prerequisites
Before you begin, ensure you have:- Python 3.8 or higher
- PyTorch 1.12 or higher (2.0+ recommended)
- CUDA 11.4 or higher (for GPU users)
- At least 8GB GPU memory for Qwen-7B (16GB+ recommended)
Installation
Install the required dependencies:For detailed installation instructions including flash-attention setup, see the Installation guide.
Quick Start with Qwen-Chat
The fastest way to start using Qwen is with the chat models. Here’s a simple example:Load Model and Tokenizer
Load the Qwen-7B-Chat model and tokenizer:
The
device_map="auto" parameter automatically handles device placement and multi-GPU distribution.Available Models
Qwen provides models in various sizes to suit different needs:Qwen-1.8B-Chat
Smallest model, fastest inference, lowest memory requirements (2.9GB GPU)
Qwen-7B-Chat
Balanced performance and efficiency (8.2GB GPU)
Qwen-14B-Chat
High performance for complex tasks (13.0GB GPU)
Qwen-72B-Chat
Best performance, highest capabilities (requires 2xA100 or 48.9GB with Int4)
Model Precision Options
Qwen supports multiple precision formats for different hardware and performance needs:Using Base Models
If you need the base language model (without chat alignment) for completion tasks:Using ModelScope (Alternative)
For users in regions with better access to ModelScope:Running the CLI Demo
Qwen includes a command-line interactive chat demo:CLI Demo Commands
CLI Demo Commands
The CLI demo supports several commands:
:helpor:h- Show help message:exitor:q- Exit the demo:clearor:cl- Clear screen:clear-hisor:clh- Clear conversation history:historyor:his- Show conversation history:seed <N>- Set random seed:conf- Show current generation config:conf <key>=<value>- Change generation config:reset-conf- Reset generation config
CLI Demo Options
Using Quantized Models
For lower memory requirements and faster inference, use quantized models:Quantized models achieve minimal performance degradation while significantly reducing memory requirements. See the Quantization section for detailed performance comparisons.
Memory Requirements
Here’s a quick reference for GPU memory requirements when generating 2048 tokens:| Model Size | BF16 | Int8 | Int4 |
|---|---|---|---|
| Qwen-1.8B | 4.23GB | 3.48GB | 2.91GB |
| Qwen-7B | 16.99GB | 11.20GB | 8.21GB |
| Qwen-14B | 30.15GB | 18.81GB | 13.01GB |
| Qwen-72B | 144.69GB (2xA100) | 81.27GB (2xA100) | 48.86GB |
Troubleshooting
ImportError: trust_remote_code
ImportError: trust_remote_code
Make sure you’re using
transformers>=4.32.0. Update with:CUDA Out of Memory
CUDA Out of Memory
Try these solutions:
- Use a smaller model (e.g., Qwen-1.8B or Qwen-7B)
- Use quantized models (Int4 or Int8)
- Reduce the maximum sequence length
- Enable gradient checkpointing for training
Slow Inference Speed
Slow Inference Speed
To improve inference speed:
- Install flash-attention (see Installation)
- Use quantized models for faster generation
- Consider using vLLM for production deployments
- Use GPU instead of CPU
Network Issues Downloading Models
Network Issues Downloading Models
If you have trouble downloading from Hugging Face:
Next Steps
Installation
Complete installation guide with flash-attention and Docker setup
Model Selection
Choose the right model for your use case
Inference Guide
Learn advanced inference techniques
Fine-tuning
Customize Qwen for your specific tasks