Skip to main content
Qwen models outperform baseline models of similar sizes across a series of benchmark datasets, evaluating capabilities in natural language understanding, mathematical problem solving, coding, and more.

Overall Performance

Qwen-72B achieves better performance than LLaMA2-70B on all tasks and outperforms GPT-3.5 on 7 out of 10 tasks. Qwen-72B Performance Radar

Base Model Benchmarks

Performance comparison across major benchmarks for all Qwen base models:
ModelMMLUC-EvalGSM8KMATHHumanEvalMBPPBBHCMMLU
5-shot5-shot8-shot4-shot0-shot3-shot3-shot5-shot
LLaMA2-7B46.832.516.73.312.820.838.231.8
LLaMA2-13B55.041.429.65.018.930.345.638.4
LLaMA2-34B62.6-42.26.222.633.044.1-
ChatGLM2-6B47.951.732.46.5--33.7-
InternLM-7B51.053.431.26.310.414.037.051.8
InternLM-20B62.158.852.67.925.635.652.559.0
Baichuan2-7B54.756.324.65.618.324.241.657.1
Baichuan2-13B59.559.052.810.117.130.249.062.0
Yi-34B76.381.867.915.926.238.266.482.6
XVERSE-65B70.868.660.3-26.3---
Qwen-1.8B45.356.132.32.315.214.222.352.1
Qwen-7B58.263.551.711.629.931.645.062.2
Qwen-14B66.372.161.324.832.340.853.471.0
Qwen-72B77.483.378.935.235.452.267.783.6
For all compared models, we report the best scores between their official reported results and OpenCompass.

World Knowledge

C-Eval Performance

C-Eval is a comprehensive benchmark testing common-sense capability in Chinese, covering 52 subjects across humanities, social sciences, STEM, and other specialties. Qwen-7B on C-Eval Validation Set:
ModelAverage
Alpaca-7B28.9
Vicuna-7B31.2
ChatGLM-6B37.1
Baichuan-7B42.7
ChatGLM2-6B50.9
InternLM-7B53.4
ChatGPT53.5
Claude-v1.355.5
Qwen-7B60.8
Qwen-7B on C-Eval Test Set:
ModelAvg.Avg. (Hard)STEMSocial SciencesHumanitiesOthers
ChatGLM-6B38.929.233.348.341.338.0
Chinese-Alpaca-Plus-13B41.530.536.649.743.141.2
Baichuan-7B42.831.538.252.046.239.3
ChatGLM2-6B51.737.148.660.551.349.8
InternLM-7B52.837.148.067.455.445.8
Qwen-7B59.641.052.874.163.155.2

MMLU Performance

MMLU evaluates English comprehension abilities across 57 subtasks spanning different academic fields and difficulty levels. 5-shot MMLU Accuracy:
ModelAverageSTEMSocial SciencesHumanitiesOthers
LLaMA-7B35.130.538.334.038.1
Baichuan-7B42.335.648.938.448.1
LLaMA2-7B45.336.451.242.952.2
LLaMA-13B46.935.853.845.053.3
ChatGLM2-6B47.941.254.443.754.5
InternLM-7B51.0----
Baichuan-13B51.641.660.947.458.5
LLaMA2-13B54.844.162.652.861.1
ChatGLM2-12B56.248.265.152.660.9
Qwen-7B56.747.665.951.564.7

Coding Capabilities

HumanEval

Zero-shot Pass@1 performance on HumanEval benchmark:
ModelPass@1
Baichuan-7B9.2
ChatGLM2-6B9.2
InternLM-7B10.4
LLaMA-7B10.5
LLaMA2-7B12.8
Baichuan-13B12.8
LLaMA-13B15.8
MPT-7B18.3
LLaMA2-13B18.3
Qwen-7B24.4

Mathematical Reasoning

GSM8K

8-shot accuracy on GSM8K benchmark:
ModelAccuracy
MPT-7B6.8
Falcon-7B6.8
Baichuan-7B9.7
LLaMA-7B11.0
LLaMA2-7B14.6
LLaMA-13B17.8
Baichuan-13B26.6
LLaMA2-13B28.7
InternLM-7B31.2
ChatGLM2-6B32.4
ChatGLM2-12B40.9
Qwen-7B51.6

Translation Capabilities

WMT22

5-shot BLEU scores on WMT22 translation tasks:
ModelAveragezh-enen-zh
InternLM-7B11.89.014.5
LLaMA-7B12.716.78.7
LLaMA-13B15.819.512.0
LLaMA2-7B19.921.917.9
Bloom-7B20.319.121.4
LLaMA2-13B23.322.424.2
PolyLM-13B23.620.227.0
Baichuan-7B24.622.626.6
Qwen-7B27.524.330.6

Chat Model Performance

World Knowledge (Chat)

Zero-shot C-Eval Validation Set:
ModelAvg. Acc.
LLaMA2-7B-Chat31.9
LLaMA2-13B-Chat40.6
Chinese-Alpaca-2-7B41.3
Chinese-Alpaca-Plus-13B43.3
Baichuan-13B-Chat50.4
ChatGLM2-6B-Chat50.7
InternLM-7B-Chat53.2
Qwen-7B-Chat54.2
Zero-shot MMLU:
ModelAvg. Acc.
ChatGLM2-6B-Chat45.5
LLaMA2-7B-Chat47.0
InternLM-7B-Chat50.8
Baichuan-13B-Chat52.1
ChatGLM2-12B-Chat52.1
Qwen-7B-Chat53.9

Coding (Chat)

Zero-shot Pass@1 on HumanEval:
ModelPass@1
LLaMA2-7B-Chat12.2
InternLM-7B-Chat14.0
Baichuan-13B-Chat16.5
LLaMA2-13B-Chat18.9
Qwen-7B-Chat24.4

Math (Chat)

GSM8K performance:
ModelZero-shot Acc.4-shot Acc.
ChatGLM2-6B-Chat-28.0
LLaMA2-7B-Chat20.428.2
LLaMA2-13B-Chat29.436.7
InternLM-7B-Chat32.634.5
Baichuan-13B-Chat-36.3
ChatGLM2-12B-Chat-38.1
Qwen-7B-Chat41.143.5

Quantized Model Performance

Quantized models maintain near-lossless performance while improving memory efficiency:
QuantizationMMLUCEval (val)GSM8KHumaneval
Qwen-1.8B-Chat (BF16)43.355.633.726.2
Qwen-1.8B-Chat (Int8)43.155.833.027.4
Qwen-1.8B-Chat (Int4)42.952.831.225.0
Qwen-7B-Chat (BF16)55.859.750.337.2
Qwen-7B-Chat (Int8)55.459.448.334.8
Qwen-7B-Chat (Int4)55.159.249.729.9
Qwen-14B-Chat (BF16)64.669.860.143.9
Qwen-14B-Chat (Int8)63.668.660.048.2
Qwen-14B-Chat (Int4)63.369.059.845.7
Qwen-72B-Chat (BF16)74.480.176.464.6
Qwen-72B-Chat (Int8)73.580.173.562.2
Qwen-72B-Chat (Int4)73.480.175.361.6

Tool Usage Capabilities

Qwen-7B-Chat performance on tool selection and usage: ReAct Prompting Evaluation:
ModelTool Selection (Acc.↑)Tool Input (Rouge-L↑)False Positive Error↓
GPT-495%0.9015.0%
GPT-3.585%0.8875.0%
Qwen-7B99%0.899.7%
HuggingFace Agent Benchmark:
ModelTool Selection↑Tool Used↑Code↑
GPT-4100.00100.0097.41
GPT-3.595.3796.3087.04
StarCoder-15.5B87.0487.9668.89
Qwen-7B90.7492.5974.07
The plugins in the evaluation set do not appear in Qwen’s training set, demonstrating genuine generalization capability.

Long Context Performance

Perplexity (PPL) on arXiv dataset with extended context lengths:
Model102420484096819216384
Qwen-7B4.233.7839.35469.812645.09
+ dynamic_ntk4.233.783.593.665.71
+ dynamic_ntk + logn4.233.783.583.564.62
+ dynamic_ntk + logn + local_attn4.233.783.583.494.32
Qwen supports training-free long-context inference from 2048 to over 8192 tokens using NTK-aware interpolation, LogN attention scaling, and local window attention.

Additional Resources

For detailed model performance on additional benchmark datasets, please refer to the technical report.

Build docs developers (and LLMs) love