Qwen models outperform baseline models of similar sizes across a series of benchmark datasets, evaluating capabilities in natural language understanding, mathematical problem solving, coding, and more.
Qwen-72B achieves better performance than LLaMA2-70B on all tasks and outperforms GPT-3.5 on 7 out of 10 tasks.
Base Model Benchmarks
Performance comparison across major benchmarks for all Qwen base models:
| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU |
|---|
| 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot |
| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 |
| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 |
| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - |
| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - |
| InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 |
| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 |
| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 |
| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 |
| Yi-34B | 76.3 | 81.8 | 67.9 | 15.9 | 26.2 | 38.2 | 66.4 | 82.6 |
| XVERSE-65B | 70.8 | 68.6 | 60.3 | - | 26.3 | - | - | - |
| Qwen-1.8B | 45.3 | 56.1 | 32.3 | 2.3 | 15.2 | 14.2 | 22.3 | 52.1 |
| Qwen-7B | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 |
| Qwen-14B | 66.3 | 72.1 | 61.3 | 24.8 | 32.3 | 40.8 | 53.4 | 71.0 |
| Qwen-72B | 77.4 | 83.3 | 78.9 | 35.2 | 35.4 | 52.2 | 67.7 | 83.6 |
For all compared models, we report the best scores between their official reported results and OpenCompass.
World Knowledge
C-Eval is a comprehensive benchmark testing common-sense capability in Chinese, covering 52 subjects across humanities, social sciences, STEM, and other specialties.
Qwen-7B on C-Eval Validation Set:
| Model | Average |
|---|
| Alpaca-7B | 28.9 |
| Vicuna-7B | 31.2 |
| ChatGLM-6B | 37.1 |
| Baichuan-7B | 42.7 |
| ChatGLM2-6B | 50.9 |
| InternLM-7B | 53.4 |
| ChatGPT | 53.5 |
| Claude-v1.3 | 55.5 |
| Qwen-7B | 60.8 |
Qwen-7B on C-Eval Test Set:
| Model | Avg. | Avg. (Hard) | STEM | Social Sciences | Humanities | Others |
|---|
| ChatGLM-6B | 38.9 | 29.2 | 33.3 | 48.3 | 41.3 | 38.0 |
| Chinese-Alpaca-Plus-13B | 41.5 | 30.5 | 36.6 | 49.7 | 43.1 | 41.2 |
| Baichuan-7B | 42.8 | 31.5 | 38.2 | 52.0 | 46.2 | 39.3 |
| ChatGLM2-6B | 51.7 | 37.1 | 48.6 | 60.5 | 51.3 | 49.8 |
| InternLM-7B | 52.8 | 37.1 | 48.0 | 67.4 | 55.4 | 45.8 |
| Qwen-7B | 59.6 | 41.0 | 52.8 | 74.1 | 63.1 | 55.2 |
MMLU evaluates English comprehension abilities across 57 subtasks spanning different academic fields and difficulty levels.
5-shot MMLU Accuracy:
| Model | Average | STEM | Social Sciences | Humanities | Others |
|---|
| LLaMA-7B | 35.1 | 30.5 | 38.3 | 34.0 | 38.1 |
| Baichuan-7B | 42.3 | 35.6 | 48.9 | 38.4 | 48.1 |
| LLaMA2-7B | 45.3 | 36.4 | 51.2 | 42.9 | 52.2 |
| LLaMA-13B | 46.9 | 35.8 | 53.8 | 45.0 | 53.3 |
| ChatGLM2-6B | 47.9 | 41.2 | 54.4 | 43.7 | 54.5 |
| InternLM-7B | 51.0 | - | - | - | - |
| Baichuan-13B | 51.6 | 41.6 | 60.9 | 47.4 | 58.5 |
| LLaMA2-13B | 54.8 | 44.1 | 62.6 | 52.8 | 61.1 |
| ChatGLM2-12B | 56.2 | 48.2 | 65.1 | 52.6 | 60.9 |
| Qwen-7B | 56.7 | 47.6 | 65.9 | 51.5 | 64.7 |
Coding Capabilities
HumanEval
Zero-shot Pass@1 performance on HumanEval benchmark:
| Model | Pass@1 |
|---|
| Baichuan-7B | 9.2 |
| ChatGLM2-6B | 9.2 |
| InternLM-7B | 10.4 |
| LLaMA-7B | 10.5 |
| LLaMA2-7B | 12.8 |
| Baichuan-13B | 12.8 |
| LLaMA-13B | 15.8 |
| MPT-7B | 18.3 |
| LLaMA2-13B | 18.3 |
| Qwen-7B | 24.4 |
Mathematical Reasoning
GSM8K
8-shot accuracy on GSM8K benchmark:
| Model | Accuracy |
|---|
| MPT-7B | 6.8 |
| Falcon-7B | 6.8 |
| Baichuan-7B | 9.7 |
| LLaMA-7B | 11.0 |
| LLaMA2-7B | 14.6 |
| LLaMA-13B | 17.8 |
| Baichuan-13B | 26.6 |
| LLaMA2-13B | 28.7 |
| InternLM-7B | 31.2 |
| ChatGLM2-6B | 32.4 |
| ChatGLM2-12B | 40.9 |
| Qwen-7B | 51.6 |
Translation Capabilities
WMT22
5-shot BLEU scores on WMT22 translation tasks:
| Model | Average | zh-en | en-zh |
|---|
| InternLM-7B | 11.8 | 9.0 | 14.5 |
| LLaMA-7B | 12.7 | 16.7 | 8.7 |
| LLaMA-13B | 15.8 | 19.5 | 12.0 |
| LLaMA2-7B | 19.9 | 21.9 | 17.9 |
| Bloom-7B | 20.3 | 19.1 | 21.4 |
| LLaMA2-13B | 23.3 | 22.4 | 24.2 |
| PolyLM-13B | 23.6 | 20.2 | 27.0 |
| Baichuan-7B | 24.6 | 22.6 | 26.6 |
| Qwen-7B | 27.5 | 24.3 | 30.6 |
World Knowledge (Chat)
Zero-shot C-Eval Validation Set:
| Model | Avg. Acc. |
|---|
| LLaMA2-7B-Chat | 31.9 |
| LLaMA2-13B-Chat | 40.6 |
| Chinese-Alpaca-2-7B | 41.3 |
| Chinese-Alpaca-Plus-13B | 43.3 |
| Baichuan-13B-Chat | 50.4 |
| ChatGLM2-6B-Chat | 50.7 |
| InternLM-7B-Chat | 53.2 |
| Qwen-7B-Chat | 54.2 |
Zero-shot MMLU:
| Model | Avg. Acc. |
|---|
| ChatGLM2-6B-Chat | 45.5 |
| LLaMA2-7B-Chat | 47.0 |
| InternLM-7B-Chat | 50.8 |
| Baichuan-13B-Chat | 52.1 |
| ChatGLM2-12B-Chat | 52.1 |
| Qwen-7B-Chat | 53.9 |
Coding (Chat)
Zero-shot Pass@1 on HumanEval:
| Model | Pass@1 |
|---|
| LLaMA2-7B-Chat | 12.2 |
| InternLM-7B-Chat | 14.0 |
| Baichuan-13B-Chat | 16.5 |
| LLaMA2-13B-Chat | 18.9 |
| Qwen-7B-Chat | 24.4 |
Math (Chat)
GSM8K performance:
| Model | Zero-shot Acc. | 4-shot Acc. |
|---|
| ChatGLM2-6B-Chat | - | 28.0 |
| LLaMA2-7B-Chat | 20.4 | 28.2 |
| LLaMA2-13B-Chat | 29.4 | 36.7 |
| InternLM-7B-Chat | 32.6 | 34.5 |
| Baichuan-13B-Chat | - | 36.3 |
| ChatGLM2-12B-Chat | - | 38.1 |
| Qwen-7B-Chat | 41.1 | 43.5 |
Quantized models maintain near-lossless performance while improving memory efficiency:
| Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
|---|
| Qwen-1.8B-Chat (BF16) | 43.3 | 55.6 | 33.7 | 26.2 |
| Qwen-1.8B-Chat (Int8) | 43.1 | 55.8 | 33.0 | 27.4 |
| Qwen-1.8B-Chat (Int4) | 42.9 | 52.8 | 31.2 | 25.0 |
| Qwen-7B-Chat (BF16) | 55.8 | 59.7 | 50.3 | 37.2 |
| Qwen-7B-Chat (Int8) | 55.4 | 59.4 | 48.3 | 34.8 |
| Qwen-7B-Chat (Int4) | 55.1 | 59.2 | 49.7 | 29.9 |
| Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 60.1 | 43.9 |
| Qwen-14B-Chat (Int8) | 63.6 | 68.6 | 60.0 | 48.2 |
| Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 |
| Qwen-72B-Chat (BF16) | 74.4 | 80.1 | 76.4 | 64.6 |
| Qwen-72B-Chat (Int8) | 73.5 | 80.1 | 73.5 | 62.2 |
| Qwen-72B-Chat (Int4) | 73.4 | 80.1 | 75.3 | 61.6 |
Qwen-7B-Chat performance on tool selection and usage:
ReAct Prompting Evaluation:
| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
|---|
| GPT-4 | 95% | 0.90 | 15.0% |
| GPT-3.5 | 85% | 0.88 | 75.0% |
| Qwen-7B | 99% | 0.89 | 9.7% |
HuggingFace Agent Benchmark:
| Model | Tool Selection↑ | Tool Used↑ | Code↑ |
|---|
| GPT-4 | 100.00 | 100.00 | 97.41 |
| GPT-3.5 | 95.37 | 96.30 | 87.04 |
| StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
| Qwen-7B | 90.74 | 92.59 | 74.07 |
The plugins in the evaluation set do not appear in Qwen’s training set, demonstrating genuine generalization capability.
Long Context Performance
Perplexity (PPL) on arXiv dataset with extended context lengths:
| Model | 1024 | 2048 | 4096 | 8192 | 16384 |
|---|
| Qwen-7B | 4.23 | 3.78 | 39.35 | 469.81 | 2645.09 |
| + dynamic_ntk | 4.23 | 3.78 | 3.59 | 3.66 | 5.71 |
| + dynamic_ntk + logn | 4.23 | 3.78 | 3.58 | 3.56 | 4.62 |
| + dynamic_ntk + logn + local_attn | 4.23 | 3.78 | 3.58 | 3.49 | 4.32 |
Qwen supports training-free long-context inference from 2048 to over 8192 tokens using NTK-aware interpolation, LogN attention scaling, and local window attention.
Additional Resources
For detailed model performance on additional benchmark datasets, please refer to the technical report.