The complete technical report is available at https://arxiv.org/abs/2309.16609
Model Series
We release pretrained and human-aligned language models in multiple sizes:- Qwen-1.8B / Qwen-1.8B-Chat - 1.8 billion parameters
- Qwen-7B / Qwen-7B-Chat - 7 billion parameters
- Qwen-14B / Qwen-14B-Chat - 14 billion parameters
- Qwen-72B / Qwen-72B-Chat - 72 billion parameters
- Base pretrained models (Qwen-*B) for general language modeling
- Chat models (Qwen-*B-Chat) aligned with human intent through supervised fine-tuning
Pretraining
Architecture
Qwen is built with a transformer-based decoder-only architecture similar to the LLaMA series, with the following key modifications:- Untied embedding - Separate input and output embeddings
- Rotary positional embedding (RoPE) - Efficient position encoding
- No biases - Except for QKV projections in attention
- RMSNorm - Instead of LayerNorm for normalization
- SwiGLU activation - Instead of ReLU
- Flash Attention - For accelerated training and inference
- 32 transformer layers
- 4096 embedding dimensions
- 32 attention heads
- Context length: 2048 tokens (expandable to 8192+ with training-free methods)
Training Data
Qwen models are pretrained on over 2.2-3.0 trillion tokens of multilingual data: Data Sources:- Web documents from publicly available sources
- Code files
- Multilingual content with focus on English and Chinese
- Mathematical reasoning data from gsm8k-ScRel
- Ensemble filtering to exclude low-quality and NSFW content
- Global fuzzy deduplication
- Mix optimization through extensive ablation experiments
Tokenization
Qwen uses a custom tokenizer with 151,851 tokens (151,643 regular + 208 control tokens):- Built on BPE tokenization over UTF-8 bytes
- Uses the
tiktokenlibrary - Optimized for Chinese, English, and code
- Multilingual-friendly without vocabulary expansion
- Numbers segmented by single digits
Training Details
Optimizer: AdamW- β₁ = 0.9
- β₂ = 0.95
- ε = 10⁻⁶
- Sequence length: 2048
- Batch size: 2048
- ~4 million tokens per optimization step
- Cosine schedule with warm-up
- Warm-up steps: 2000
- Peak learning rate: 3 × 10⁻⁴
- Minimum learning rate: 10% of peak
- Weight decay: 0.1
- Gradient clipping: 1.0
- Mixed precision training with bfloat16
Fine-tuning
Qwen-Chat models embody our practice in alignment with human intents, internalized safety, and intelligent agent capabilities.Alignment Data
The fine-tuning data includes: Instruction Data:- Writing and creative content
- Question answering
- Brainstorming and planning
- Content understanding and summarization
- Natural language processing tasks
- Code generation and analysis
- Prevention of harmful content generation
- Inappropriate content filtering
- Substantial annotation efforts for safety
- Tool usage patterns
- External system integration
- Parseable conversation patterns for API calls
Data Formatting
Conversations are formatted using ChatML (Chat Markup Language):system- System instructions and contextuser- User messagesassistant- Model responses
Training Configuration
Objective: Causal language modeling (excluding user turn tokens) Optimizer: AdamW- β₁ = 0.9
- β₂ = 0.95
- ε = 10⁻⁶
- Sequence length: 2048
- Batch size: 128
- Training steps: 4000
- Warm-up steps: 1430
- Peak learning rate: 1 × 10⁻⁵
- Weight decay: 0.1
- Dropout: 0.1
- Gradient clipping: 1.0
Model Capabilities
Strong Base Performance
Qwen models achieve competitive or superior performance compared to similar-sized models across:- Natural Language Understanding: MMLU, C-Eval, CMMLU
- Mathematical Reasoning: GSM8K, MATH
- Code Generation: HumanEval, MBPP
- General Reasoning: BBH
- Translation: WMT22
Chat Model Features
Conversational AI:- Multi-turn dialogue with context awareness
- Content creation and brainstorming
- Information extraction and summarization
- Translation capabilities
- ReAct prompting support
- Plugin/API integration
- External system coordination
- HuggingFace Agent compatibility
- Code generation and completion
- Code understanding and analysis
- Debugging assistance
- Multiple programming languages
Long Context Inference
Qwen supports training-free context extension from 2048 to 8192+ tokens through:- NTK-aware interpolation (dynamic_ntk)
- LogN attention scaling
- Local window attention
Model Variants
Quantized Models
We provide Int4 and Int8 quantized models using AutoGPTQ:- Near-lossless performance
- Reduced memory footprint
- Improved inference speed
- Available for all model sizes
- Qwen-7B-Chat-Int4: ~8.2GB memory (vs 17GB BF16)
- Qwen-72B-Chat-Int4: ~49GB memory (vs 145GB BF16)
System Prompt Enhancement
Qwen-72B-Chat and Qwen-1.8B-Chat feature enhanced system prompt capabilities for better instruction following and role-playing.Technical Innovations
Tokenizer Design
Unlike tokenizers based on Unicode codepoints with UTF-8 fallback, Qwen operates directly on UTF-8 byte sequences:- Efficient encoding across languages
- No unknown tokens
- Vocabulary expansion support
- Injection attack prevention
KV Cache Quantization
Optional Int8 quantization of attention KV cache:- Higher sample throughput
- Reduced memory for long sequences
- Larger batch sizes
- Minimal performance degradation
Flash Attention Integration
Native Flash Attention 2 support provides:- Faster training and inference
- Lower memory consumption
- Improved batch processing efficiency
Training Infrastructure
Qwen models are trained on:- NVIDIA A100 GPUs
- PyTorch 2.0+
- DeepSpeed for distributed training
- Mixed precision (bfloat16)
- CUDA 11.4+
Deployment Options
Multiple deployment configurations supported:- Single GPU: BF16/FP16/Int8/Int4
- Multi-GPU: Native pipeline parallelism or vLLM
- CPU: Direct inference or qwen.cpp
- Cloud API: DashScope service
- Edge Devices: Quantized models with reduced requirements
Safety and Alignment
Safety Measures
- Security-oriented training data
- NSFW content filtering
- Harmful content prevention
- Extensive red teaming
Responsible Development
Model Release Philosophy
Our goal is to enable the community to:- Analyze and improve model safety
- Understand quantization and fine-tuning techniques
- Explore training-free long-context inference
- Build service-oriented applications with tool usage
- Establish responsible LLM development practices
Future Directions
Ongoing research and development includes:- RLHF (Reinforcement Learning from Human Feedback)
- Extended context lengths
- Multimodal capabilities
- Improved tool usage and agent behaviors
- Enhanced safety and alignment techniques