Benchmark Results

Qwen models outperform baseline models of similar sizes across a series of benchmark datasets, evaluating capabilities in natural language understanding, mathematical problem solving, coding, and more.

Overall Performance

Qwen-72B achieves better performance than LLaMA2-70B on all tasks and outperforms GPT-3.5 on 7 out of 10 tasks. Qwen-72B Performance Radar

Base Model Benchmarks

Performance comparison across major benchmarks for all Qwen base models:

Model	MMLU	C-Eval	GSM8K	MATH	HumanEval	MBPP	BBH	CMMLU
	5-shot	5-shot	8-shot	4-shot	0-shot	3-shot	3-shot	5-shot
LLaMA2-7B	46.8	32.5	16.7	3.3	12.8	20.8	38.2	31.8
LLaMA2-13B	55.0	41.4	29.6	5.0	18.9	30.3	45.6	38.4
LLaMA2-34B	62.6	-	42.2	6.2	22.6	33.0	44.1	-
ChatGLM2-6B	47.9	51.7	32.4	6.5	-	-	33.7	-
InternLM-7B	51.0	53.4	31.2	6.3	10.4	14.0	37.0	51.8
InternLM-20B	62.1	58.8	52.6	7.9	25.6	35.6	52.5	59.0
Baichuan2-7B	54.7	56.3	24.6	5.6	18.3	24.2	41.6	57.1
Baichuan2-13B	59.5	59.0	52.8	10.1	17.1	30.2	49.0	62.0
Yi-34B	76.3	81.8	67.9	15.9	26.2	38.2	66.4	82.6
XVERSE-65B	70.8	68.6	60.3	-	26.3	-	-	-
Qwen-1.8B	45.3	56.1	32.3	2.3	15.2	14.2	22.3	52.1
Qwen-7B	58.2	63.5	51.7	11.6	29.9	31.6	45.0	62.2
Qwen-14B	66.3	72.1	61.3	24.8	32.3	40.8	53.4	71.0
Qwen-72B	77.4	83.3	78.9	35.2	35.4	52.2	67.7	83.6

For all compared models, we report the best scores between their official reported results and OpenCompass.

World Knowledge

C-Eval Performance

C-Eval is a comprehensive benchmark testing common-sense capability in Chinese, covering 52 subjects across humanities, social sciences, STEM, and other specialties. Qwen-7B on C-Eval Validation Set:

Model	Average
Alpaca-7B	28.9
Vicuna-7B	31.2
ChatGLM-6B	37.1
Baichuan-7B	42.7
ChatGLM2-6B	50.9
InternLM-7B	53.4
ChatGPT	53.5
Claude-v1.3	55.5
Qwen-7B	60.8

Qwen-7B on C-Eval Test Set:

Model	Avg.	Avg. (Hard)	STEM	Social Sciences	Humanities	Others
ChatGLM-6B	38.9	29.2	33.3	48.3	41.3	38.0
Chinese-Alpaca-Plus-13B	41.5	30.5	36.6	49.7	43.1	41.2
Baichuan-7B	42.8	31.5	38.2	52.0	46.2	39.3
ChatGLM2-6B	51.7	37.1	48.6	60.5	51.3	49.8
InternLM-7B	52.8	37.1	48.0	67.4	55.4	45.8
Qwen-7B	59.6	41.0	52.8	74.1	63.1	55.2

MMLU Performance

MMLU evaluates English comprehension abilities across 57 subtasks spanning different academic fields and difficulty levels. 5-shot MMLU Accuracy:

Model	Average	STEM	Social Sciences	Humanities	Others
LLaMA-7B	35.1	30.5	38.3	34.0	38.1
Baichuan-7B	42.3	35.6	48.9	38.4	48.1
LLaMA2-7B	45.3	36.4	51.2	42.9	52.2
LLaMA-13B	46.9	35.8	53.8	45.0	53.3
ChatGLM2-6B	47.9	41.2	54.4	43.7	54.5
InternLM-7B	51.0	-	-	-	-
Baichuan-13B	51.6	41.6	60.9	47.4	58.5
LLaMA2-13B	54.8	44.1	62.6	52.8	61.1
ChatGLM2-12B	56.2	48.2	65.1	52.6	60.9
Qwen-7B	56.7	47.6	65.9	51.5	64.7

Coding Capabilities

HumanEval

Zero-shot Pass@1 performance on HumanEval benchmark:

Model	Pass@1
Baichuan-7B	9.2
ChatGLM2-6B	9.2
InternLM-7B	10.4
LLaMA-7B	10.5
LLaMA2-7B	12.8
Baichuan-13B	12.8
LLaMA-13B	15.8
MPT-7B	18.3
LLaMA2-13B	18.3
Qwen-7B	24.4

Mathematical Reasoning

GSM8K

8-shot accuracy on GSM8K benchmark:

Model	Accuracy
MPT-7B	6.8
Falcon-7B	6.8
Baichuan-7B	9.7
LLaMA-7B	11.0
LLaMA2-7B	14.6
LLaMA-13B	17.8
Baichuan-13B	26.6
LLaMA2-13B	28.7
InternLM-7B	31.2
ChatGLM2-6B	32.4
ChatGLM2-12B	40.9
Qwen-7B	51.6

Translation Capabilities

WMT22

5-shot BLEU scores on WMT22 translation tasks:

Model	Average	zh-en	en-zh
InternLM-7B	11.8	9.0	14.5
LLaMA-7B	12.7	16.7	8.7
LLaMA-13B	15.8	19.5	12.0
LLaMA2-7B	19.9	21.9	17.9
Bloom-7B	20.3	19.1	21.4
LLaMA2-13B	23.3	22.4	24.2
PolyLM-13B	23.6	20.2	27.0
Baichuan-7B	24.6	22.6	26.6
Qwen-7B	27.5	24.3	30.6

Chat Model Performance

World Knowledge (Chat)

Zero-shot C-Eval Validation Set:

Model	Avg. Acc.
LLaMA2-7B-Chat	31.9
LLaMA2-13B-Chat	40.6
Chinese-Alpaca-2-7B	41.3
Chinese-Alpaca-Plus-13B	43.3
Baichuan-13B-Chat	50.4
ChatGLM2-6B-Chat	50.7
InternLM-7B-Chat	53.2
Qwen-7B-Chat	54.2

Zero-shot MMLU:

Model	Avg. Acc.
ChatGLM2-6B-Chat	45.5
LLaMA2-7B-Chat	47.0
InternLM-7B-Chat	50.8
Baichuan-13B-Chat	52.1
ChatGLM2-12B-Chat	52.1
Qwen-7B-Chat	53.9

Coding (Chat)

Zero-shot Pass@1 on HumanEval:

Model	Pass@1
LLaMA2-7B-Chat	12.2
InternLM-7B-Chat	14.0
Baichuan-13B-Chat	16.5
LLaMA2-13B-Chat	18.9
Qwen-7B-Chat	24.4

Math (Chat)

GSM8K performance:

Model	Zero-shot Acc.	4-shot Acc.
ChatGLM2-6B-Chat	-	28.0
LLaMA2-7B-Chat	20.4	28.2
LLaMA2-13B-Chat	29.4	36.7
InternLM-7B-Chat	32.6	34.5
Baichuan-13B-Chat	-	36.3
ChatGLM2-12B-Chat	-	38.1
Qwen-7B-Chat	41.1	43.5

Quantized Model Performance

Quantized models maintain near-lossless performance while improving memory efficiency:

Quantization	MMLU	CEval (val)	GSM8K	Humaneval
Qwen-1.8B-Chat (BF16)	43.3	55.6	33.7	26.2
Qwen-1.8B-Chat (Int8)	43.1	55.8	33.0	27.4
Qwen-1.8B-Chat (Int4)	42.9	52.8	31.2	25.0
Qwen-7B-Chat (BF16)	55.8	59.7	50.3	37.2
Qwen-7B-Chat (Int8)	55.4	59.4	48.3	34.8
Qwen-7B-Chat (Int4)	55.1	59.2	49.7	29.9
Qwen-14B-Chat (BF16)	64.6	69.8	60.1	43.9
Qwen-14B-Chat (Int8)	63.6	68.6	60.0	48.2
Qwen-14B-Chat (Int4)	63.3	69.0	59.8	45.7
Qwen-72B-Chat (BF16)	74.4	80.1	76.4	64.6
Qwen-72B-Chat (Int8)	73.5	80.1	73.5	62.2
Qwen-72B-Chat (Int4)	73.4	80.1	75.3	61.6

Tool Usage Capabilities

Qwen-7B-Chat performance on tool selection and usage: ReAct Prompting Evaluation:

Model	Tool Selection (Acc.↑)	Tool Input (Rouge-L↑)	False Positive Error↓
GPT-4	95%	0.90	15.0%
GPT-3.5	85%	0.88	75.0%
Qwen-7B	99%	0.89	9.7%

HuggingFace Agent Benchmark:

Model	Tool Selection↑	Tool Used↑	Code↑
GPT-4	100.00	100.00	97.41
GPT-3.5	95.37	96.30	87.04
StarCoder-15.5B	87.04	87.96	68.89
Qwen-7B	90.74	92.59	74.07

The plugins in the evaluation set do not appear in Qwen’s training set, demonstrating genuine generalization capability.

Long Context Performance

Perplexity (PPL) on arXiv dataset with extended context lengths:

Model	1024	2048	4096	8192	16384
Qwen-7B	4.23	3.78	39.35	469.81	2645.09
+ dynamic_ntk	4.23	3.78	3.59	3.66	5.71
+ dynamic_ntk + logn	4.23	3.78	3.58	3.56	4.62
+ dynamic_ntk + logn + local_attn	4.23	3.78	3.58	3.49	4.32

Qwen supports training-free long-context inference from 2048 to over 8192 tokens using NTK-aware interpolation, LogN attention scaling, and local window attention.

Additional Resources

For detailed model performance on additional benchmark datasets, please refer to the technical report.

Guides

Support

Benchmark Results

Overall Performance

Base Model Benchmarks

World Knowledge

C-Eval Performance

MMLU Performance

Coding Capabilities

HumanEval

Mathematical Reasoning

GSM8K

Translation Capabilities

WMT22

Chat Model Performance

World Knowledge (Chat)

Coding (Chat)

Math (Chat)

Quantized Model Performance

Tool Usage Capabilities

Long Context Performance

Additional Resources

Build docs developers (and LLMs) love

Guides

Support

​Overall Performance

​Base Model Benchmarks

​World Knowledge

​C-Eval Performance

​MMLU Performance

​Coding Capabilities

​HumanEval

​Mathematical Reasoning

​GSM8K

​Translation Capabilities

​WMT22

​Chat Model Performance

​World Knowledge (Chat)

​Coding (Chat)

​Math (Chat)

​Quantized Model Performance

​Tool Usage Capabilities

​Long Context Performance

​Additional Resources

Build docs developers (and LLMs) love

Overall Performance

Base Model Benchmarks

World Knowledge

C-Eval Performance

MMLU Performance

Coding Capabilities

HumanEval

Mathematical Reasoning

GSM8K

Translation Capabilities

WMT22

Chat Model Performance

World Knowledge (Chat)

Coding (Chat)

Math (Chat)

Quantized Model Performance

Tool Usage Capabilities

Long Context Performance

Additional Resources