Performance Benchmarks

Overview

This page provides comprehensive performance benchmarks for Qwen models across different quantization methods, model sizes, and hardware configurations. All measurements are from production testing on real hardware.

Test Environment: A100-SXM4-80G GPU, PyTorch 2.0.1, CUDA 11.8, Flash-Attention 2 (except where noted)

Memory and Speed Comparison

Qwen-1.8B

Memory & Speed
Quality
Use Case

Precision	GPU Memory	Speed (tokens/s)	vs BF16 Memory	vs BF16 Speed
BF16	4.23 GB	54.09	-	-
Int8	3.48 GB	55.56	-18%	+2.7%
Int4	2.91 GB	71.07	-31%	+31.4%

Precision	MMLU	C-Eval	GSM8K	HumanEval
BF16	43.3	55.6	33.7	26.2
Int8	43.1	55.8	33.0	27.4
Int4	42.9	52.8	31.2	25.0

Qwen-7B

Memory & Speed
Quality
Use Case

Precision	GPU Memory	Speed (tokens/s)	vs BF16 Memory	vs BF16 Speed
BF16	16.99 GB	40.93	-	-
Int8	11.20 GB	37.47	-34%	-8.5%
Int4	8.21 GB	50.09	-52%	+22.4%

Precision	MMLU	C-Eval	GSM8K	HumanEval
BF16	55.8	59.7	50.3	37.2
Int8	55.4	59.4	48.3	34.8
Int4	55.1	59.2	49.7	29.9

Qwen-14B

Memory & Speed
Quality
Use Case

Precision	GPU Memory	Speed (tokens/s)	vs BF16 Memory	vs BF16 Speed
BF16	30.15 GB	32.22	-	-
Int8	18.81 GB	29.28	-38%	-9.1%
Int4	13.01 GB	38.72	-57%	+20.2%

Precision	MMLU	C-Eval	GSM8K	HumanEval
BF16	64.6	69.8	60.1	43.9
Int8	63.6	68.6	60.0	48.2
Int4	63.3	69.0	59.8	45.7

Qwen-72B

Memory & Speed
Quality
Use Case

Precision	GPU Memory	Speed (tokens/s)	vs BF16 Memory	vs BF16 Speed
BF16	144.69 GB (2xA100)	8.48	-	-
Int8	81.27 GB (2xA100)	9.05	-44%	+6.7%
Int4	48.86 GB	11.32	-66%	+33.5%
BF16 + vLLM	2xA100	17.60	-	+107.5%

Precision	MMLU	C-Eval	GSM8K	HumanEval
BF16	74.4	80.1	76.4	64.6
Int8	73.5	80.1	73.5	62.2
Int4	73.4	80.1	75.3	61.6

Key Finding: Int4 quantization provides the best overall value, offering 50-66% memory reduction with 20-33% speed improvement and minimal quality loss (< 2% on most benchmarks).

KV Cache Quantization Impact

Batch Size Scaling

Performance of Qwen-7B (BF16) generating 1024 tokens:

Batch Size	Without KV Cache	With KV Cache	Savings	Batch Size Increase
1	16.3 GB	15.5 GB	0.8 GB	-
4	24.1 GB	17.2 GB	6.9 GB	-
16	31.7 GB	22.3 GB	9.4 GB	-
32	48.7 GB	30.2 GB	18.5 GB	-
64	OOM	48.2 GB	-	2x enabled
100	OOM	72.4 GB	-	3x enabled

KV cache quantization enables 2-3x larger batch sizes, dramatically improving throughput for multi-user serving scenarios.

Sequence Length Scaling

Performance of Qwen-7B (BF16) with batch size 1:

Sequence Length	Without KV Cache	With KV Cache	Savings	% Reduction
512	15.2 GB	15.0 GB	0.2 GB	1%
1024	16.3 GB	15.5 GB	0.8 GB	5%
2048	17.6 GB	15.8 GB	1.8 GB	10%
4096	19.5 GB	16.6 GB	2.9 GB	15%
8192	23.2 GB	17.6 GB	5.6 GB	24%

Memory savings scale with sequence length. For 8K+ token generation, KV cache quantization reduces memory by 20-30%.

Combined Quantization

Optimal memory efficiency combines GPTQ weight quantization with KV cache quantization:

Qwen-7B Memory Breakdown

Configuration	GPU Memory	vs BF16	Notes
BF16 baseline	16.99 GB	-	No quantization
BF16 + KV cache	15.5 GB	-9%	KV cache only
Int8	11.20 GB	-34%	GPTQ only
Int8 + KV cache	~10.5 GB	-38%	Combined
Int4	8.21 GB	-52%	GPTQ only
Int4 + KV cache	~7.5 GB	-56%	Maximum efficiency

For production deployments with high batch sizes or long sequences, combining Int4 GPTQ with KV cache quantization provides optimal memory efficiency.

Fine-tuning Memory Requirements

Memory usage for fine-tuning Qwen-7B with different methods (single A100-80GB):

Method	256 tokens	512 tokens	1024 tokens	2048 tokens	4096 tokens	8192 tokens
LoRA	20.1GB / 1.2s	20.4GB / 1.5s	21.5GB / 2.8s	23.8GB / 5.2s	29.7GB / 10.1s	36.6GB / 21.3s
LoRA (emb)	33.7GB / 1.4s	34.1GB / 1.6s	35.2GB / 2.9s	35.1GB / 5.3s	39.2GB / 10.3s	48.5GB / 21.7s
Q-LoRA	11.5GB / 3.0s	11.5GB / 3.0s	12.3GB / 3.5s	13.9GB / 7.0s	16.9GB / 11.6s	23.5GB / 22.3s
Full-parameter	43.5GB / 2.1s	43.5GB / 2.2s	43.5GB / 2.2s	43.5GB / 2.3s	47.1GB / 2.8s	48.3GB / 5.6s

LoRA (emb) includes trainable embedding and output layers, required when introducing new special tokens. Q-LoRA provides the best memory efficiency for fine-tuning.

Hardware Recommendations

GPU Selection Guide

Consumer GPUs
Data Center GPUs
Multi-GPU

RTX 3090 / RTX 4090 (24GB)

Recommended Models:

Qwen-7B-Chat-Int4 (8.2GB) ✅
Qwen-14B-Chat-Int4 (13GB) ✅
Qwen-7B-Chat-Int8 (11.2GB) ✅

Not Recommended:

Qwen-14B BF16 (30GB) ❌
Qwen-72B any variant ❌

Tip: Use Int4 + KV cache for optimal batch processing

RTX 3060 / RTX 3070 (8-12GB)

Recommended Models:

Qwen-1.8B-Chat-Int4 (2.9GB) ✅
Qwen-7B-Chat-Int4 (8.2GB) ✅ (tight fit)

Not Recommended:

Qwen-7B-Chat-Int8 (11.2GB) ❌
Qwen-14B any variant ❌

Tip: Stick to smaller models or use CPU offloading

Optimization Strategies

For Memory Efficiency

Start with Int4 GPTQ

Int4 provides 50-66% memory reduction with minimal quality loss

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True
)

Enable KV cache quantization for batch/long sequences

Add KV cache quantization for larger batches or longer sequences

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True,
    use_cache_quantization=True,
    use_cache_kernel=True,
    use_flash_attn=False
)

Profile your workload

Measure actual memory usage and adjust batch size/sequence length accordingly

For Speed

Use Int4 for best speed

Int4 provides 20-33% speed improvement over BF16

Enable Flash Attention (if not using KV cache)

Flash Attention 2 improves speed by 30-40% for long sequences

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    use_flash_attn=True
)

Use vLLM for production serving

vLLM provides 2x throughput improvement with PagedAttention

Optimize batch size

Find the sweet spot between throughput and latency for your hardware

For Quality

Use BF16 baseline for critical applications

No quantization for maximum quality

Use Int8 for quality-sensitive production

Less than 1% accuracy drop on most benchmarks

Evaluate Int4 on your domain

Int4 works well for most tasks but may degrade code generation (HumanEval)

Fine-tune after quantization if needed

Q-LoRA fine-tuning can recover some quality loss

Common Bottlenecks

OOM during inference

Symptoms: CUDA out of memory errorsSolutions:

Use Int4 quantization (52-66% memory reduction)
Enable KV cache quantization
Reduce batch size or sequence length
Use gradient checkpointing (training)
Consider smaller model variant

Slow generation speed

Symptoms: Slow tokens/second, high latencySolutions:

Use Int4 quantization (20-33% faster)
Enable Flash Attention 2 (if not using KV cache)
Use vLLM for batch inference
Increase batch size to amortize overhead
Check for CPU-GPU bottlenecks

Quality degradation

Symptoms: Poor outputs, hallucinations, reduced accuracySolutions:

Switch from Int4 to Int8 or BF16
Use more calibration data during quantization
Fine-tune after quantization (Q-LoRA)
Evaluate on domain-specific benchmarks
Consider task-specific quantization

Multi-GPU inefficiency

Symptoms: Low GPU utilization, poor scalingSolutions:

Use tensor parallelism (vLLM, DeepSpeed)
Increase batch size per GPU
Use pipeline parallelism for very large models
Check network bandwidth between GPUs
Avoid DeepSpeed ZeRO 3 for multi-node (slow)

Benchmark Methodology

Inference Benchmarks

Setup:

Hardware: A100-SXM4-80G GPU (single GPU unless noted)
Software: PyTorch 2.0.1, CUDA 11.8, Flash-Attention 2
Task: Generate 2048 tokens from short prompt
Metric: Average tokens/second (includes prompt processing)
Script: profile.py

Quality Benchmarks

Benchmarks:

MMLU: 5-shot multiple-choice questions (57 subjects)
C-Eval: 5-shot Chinese evaluation (52 subjects)
GSM8K: 8-shot grade school math problems
HumanEval: 0-shot Python code generation

Evaluation: Best scores from official reports and OpenCompass

Memory Benchmarks

Measurement:

Peak GPU memory during generation (via torch.cuda.max_memory_allocated())
Includes model weights, KV cache, and temporary buffers
Single-batch inference unless otherwise noted

Next Steps

GPTQ Quantization

Implement GPTQ quantization based on these benchmarks

KV Cache Quantization

Enable KV cache quantization for your use case

Deployment Guide

Deploy optimized models in production

Fine-tuning

Fine-tune quantized models for your domain

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

​Overview

​Memory and Speed Comparison

​Qwen-1.8B

​Qwen-7B

​Qwen-14B

​Qwen-72B

​KV Cache Quantization Impact

​Batch Size Scaling

​Sequence Length Scaling

​Combined Quantization

​Qwen-7B Memory Breakdown

​Fine-tuning Memory Requirements

​Hardware Recommendations

​GPU Selection Guide

RTX 3090 / RTX 4090 (24GB)

RTX 3060 / RTX 3070 (8-12GB)

A100-80GB / H100-80GB

A100-40GB

2x A100-80GB

4x A100-80GB

​Optimization Strategies

​For Memory Efficiency

​For Speed

​For Quality

​Common Bottlenecks

​Benchmark Methodology

​Inference Benchmarks

​Quality Benchmarks

​Memory Benchmarks

​Next Steps

GPTQ Quantization

KV Cache Quantization

Deployment Guide

Fine-tuning

Build docs developers (and LLMs) love

Overview

Memory and Speed Comparison

Qwen-1.8B

Qwen-7B

Qwen-14B

Qwen-72B

KV Cache Quantization Impact

Batch Size Scaling

Sequence Length Scaling

Combined Quantization

Qwen-7B Memory Breakdown

Fine-tuning Memory Requirements

Hardware Recommendations

GPU Selection Guide

Optimization Strategies

For Memory Efficiency

For Speed

For Quality

Common Bottlenecks

Benchmark Methodology

Inference Benchmarks

Quality Benchmarks

Memory Benchmarks

Next Steps