Base Pretrained Models
Qwen base models are pretrained language models that serve as the foundation for various downstream tasks. These models have been trained on massive amounts of multilingual text data and are ideal for further fine-tuning.Model Architecture
Qwen models use a transformer-based decoder-only architecture with several optimizations:Core Components
- Architecture Type: Decoder-only Transformer (similar to LLaMA)
- Positional Encoding: Rotary Position Embedding (RoPE)
- Activation Function: SwiGLU (instead of ReLU)
- Normalization: RMSNorm (instead of LayerNorm)
- Attention Mechanism: Flash Attention 2 support
- Embeddings: Untied input and output embeddings
- Bias: No biases except for QKV in attention
Technical Specifications
- Qwen-7B
- Qwen-1.8B
- Qwen-14B
- Qwen-72B
Model Configuration
- Parameters: 7 billion
- Layers: 32
- Hidden Dimension: 4096
- Attention Heads: 32
- Vocabulary Size: 151,851 tokens
- Context Length: 32K (original: 2048, extended: 8192+)
- Training Tokens: 2.4 trillion
- Optimizer: AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁶)
- Batch Size: 2048 sequences (4M+ tokens per step)
- Learning Rate: Peak 3×10⁻⁴, cosine schedule
- Warm-up Steps: 2000
- Weight Decay: 0.1
- Gradient Clipping: 1.0
- Precision: BFloat16 mixed precision
Training Data
Qwen models are pretrained on a diverse multilingual dataset:Data Sources
- Web Documents: Publicly available web content
- Code Files: Programming repositories and code samples
- Mathematical Data: Including RFT data from gsm8k-ScRel
- Total Volume: 2.2T to 3.0T tokens (model-dependent)
Language Coverage
Primary focus on Chinese and English, with support for multilingual content including Japanese, Korean, Arabic, Thai, Vietnamese, Indonesian, Polish, Russian, Dutch, Portuguese, Italian, German, Spanish, and French.
Data Processing
- Quality Filtering: Ensemble of models to exclude low-quality content
- Safety Filtering: NSFW content removal
- Deduplication: Global fuzzy deduplication
- Optimization: Multiple ablation experiments to optimize data mix
Benchmark Performance
Chinese Language Understanding
C-Eval (5-shot) - Testing common-sense capability in Chinese across 52 subjects:| Model | Average | STEM | Social Sciences | Humanities | Others |
|---|---|---|---|---|---|
| LLaMA2-7B | 32.5 | - | - | - | - |
| ChatGLM2-6B | 51.7 | 48.6 | 60.5 | 51.3 | 49.8 |
| InternLM-7B | 53.4 | 48.0 | 67.4 | 55.4 | 45.8 |
| Qwen-7B | 63.5 | 52.8 | 74.1 | 63.1 | 55.2 |
| Qwen-14B | 72.1 | - | - | - | - |
| Qwen-72B | 83.3 | - | - | - | - |
English Language Understanding
MMLU (5-shot) - Comprehensive English evaluation across 57 academic subjects:| Model | Average | STEM | Social Sciences | Humanities | Others |
|---|---|---|---|---|---|
| LLaMA-7B | 35.1 | 30.5 | 38.3 | 34.0 | 38.1 |
| LLaMA2-7B | 46.8 | 36.4 | 51.2 | 42.9 | 52.2 |
| Baichuan-7B | 42.3 | 35.6 | 48.9 | 38.4 | 48.1 |
| ChatGLM2-6B | 47.9 | 41.2 | 54.4 | 43.7 | 54.5 |
| InternLM-7B | 51.0 | - | - | - | - |
| Qwen-7B | 58.2 | 47.6 | 65.9 | 51.5 | 64.7 |
| Qwen-14B | 66.3 | - | - | - | - |
| Qwen-72B | 77.4 | - | - | - | - |
Coding Capability
HumanEval (Pass@1) - Python coding benchmark:| Model | Pass@1 |
|---|---|
| LLaMA-7B | 10.5 |
| LLaMA2-7B | 12.8 |
| Baichuan-7B | 9.2 |
| ChatGLM2-6B | 9.2 |
| InternLM-7B | 10.4 |
| Qwen-7B | 29.9 |
| Qwen-14B | 32.3 |
| Qwen-72B | 35.4 |
Mathematical Reasoning
GSM8K (8-shot) - Grade school math problems:| Model | Accuracy |
|---|---|
| LLaMA-7B | 11.0 |
| LLaMA2-7B | 16.7 |
| Baichuan-7B | 9.7 |
| InternLM-7B | 31.2 |
| ChatGLM2-6B | 32.4 |
| Qwen-7B | 51.7 |
| Qwen-14B | 61.3 |
| Qwen-72B | 78.9 |
| Model | Accuracy |
|---|---|
| LLaMA2-7B | 3.3 |
| InternLM-7B | 6.3 |
| ChatGLM2-6B | 6.5 |
| Qwen-7B | 11.6 |
| Qwen-14B | 24.8 |
| Qwen-72B | 35.2 |
Translation
WMT22 (5-shot BLEU) - Translation quality:| Model | Average | zh→en | en→zh |
|---|---|---|---|
| LLaMA-7B | 12.7 | 16.7 | 8.7 |
| LLaMA2-7B | 19.9 | 21.9 | 17.9 |
| Baichuan-7B | 24.6 | 22.6 | 26.6 |
| InternLM-7B | 11.8 | 9.0 | 14.5 |
| Qwen-7B | 27.5 | 24.3 | 30.6 |
Long Context Support
Qwen base models support extended context lengths through training-free methods:Extension Techniques
Dynamic NTK-aware Interpolation
Dynamic NTK-aware Interpolation
Dynamically adjusts rotary position embeddings to support longer sequences without additional training.Configuration: Set
use_dynamic_ntk=true in config.jsonLogN Attention Scaling
LogN Attention Scaling
Applies logarithmic scaling to attention scores for improved long-context performance.Configuration: Set
use_logn_attn=true in config.jsonLocal Window Attention
Local Window Attention
Reduces memory usage by limiting attention to local windows for very long sequences.Configuration: Enable via inference parameters
Perplexity on arXiv (Qwen-7B)
| Method | 1024 | 2048 | 4096 | 8192 | 16384 |
|---|---|---|---|---|---|
| Baseline | 4.23 | 3.78 | 39.35 | 469.81 | 2645.09 |
| + dynamic_ntk | 4.23 | 3.78 | 3.59 | 3.66 | 5.71 |
| + dynamic_ntk + logn | 4.23 | 3.78 | 3.58 | 3.56 | 4.62 |
| + dynamic_ntk + logn + local_attn | 4.23 | 3.78 | 3.58 | 3.49 | 4.32 |
With all techniques enabled, Qwen-7B can extend from 2048 to 16384 tokens with minimal perplexity degradation.
Tokenizer
Qwen uses a custom tokenizer optimized for multilingual efficiency:Features
- Library: Based on tiktoken (OpenAI’s tokenizer)
- Vocabulary Size: 151,851 tokens
- Number Encoding: Single-digit segmentation for better arithmetic
- Efficiency: High compression rate for Chinese, English, and code
- Multilingual: Native support for 100+ languages without vocabulary expansion
Compression Efficiency
Compared to XLM-R (baseline=1.0), Qwen achieves high compression rates:- Chinese: ~1.5× more efficient
- English: ~1.3× more efficient
- Code: ~1.4× more efficient
- Other Languages: 1.2-1.5× more efficient (Thai, Hebrew, Arabic, Korean, Vietnamese, Japanese, etc.)
Usage Example
Hardware Requirements
Inference Memory Usage (Generating 2048 tokens)
- Qwen-1.8B
- Qwen-7B
- Qwen-14B
- Qwen-72B
| Precision | GPU Memory | Speed (tokens/s) |
|---|---|---|
| BF16 | 4.23GB | 54.09 |
| Int8 | 3.48GB | 55.56 |
| Int4 | 2.91GB | 71.07 |
Profiling conducted on A100-SXM4-80G GPU with PyTorch 2.0.1, CUDA 11.8, and Flash Attention 2.
Model Downloads
Qwen-1.8B
🤗 Hugging Face | 🤖 ModelScope
Qwen-7B
🤗 Hugging Face | 🤖 ModelScope
Qwen-14B
🤗 Hugging Face | 🤖 ModelScope
Qwen-72B
🤗 Hugging Face | 🤖 ModelScope
Next Steps
Chat Models
Explore conversation-aligned models
Fine-tuning
Learn how to fine-tune base models
Quantization
Reduce memory usage with quantization
Model Selection
Choose the right model size