Skip to main content

Overview

This page provides comprehensive performance benchmarks for Qwen models across different quantization methods, model sizes, and hardware configurations. All measurements are from production testing on real hardware.
Test Environment: A100-SXM4-80G GPU, PyTorch 2.0.1, CUDA 11.8, Flash-Attention 2 (except where noted)

Memory and Speed Comparison

Qwen-1.8B

PrecisionGPU MemorySpeed (tokens/s)vs BF16 Memoryvs BF16 Speed
BF164.23 GB54.09--
Int83.48 GB55.56-18%+2.7%
Int42.91 GB71.07-31%+31.4%

Qwen-7B

PrecisionGPU MemorySpeed (tokens/s)vs BF16 Memoryvs BF16 Speed
BF1616.99 GB40.93--
Int811.20 GB37.47-34%-8.5%
Int48.21 GB50.09-52%+22.4%

Qwen-14B

PrecisionGPU MemorySpeed (tokens/s)vs BF16 Memoryvs BF16 Speed
BF1630.15 GB32.22--
Int818.81 GB29.28-38%-9.1%
Int413.01 GB38.72-57%+20.2%

Qwen-72B

PrecisionGPU MemorySpeed (tokens/s)vs BF16 Memoryvs BF16 Speed
BF16144.69 GB (2xA100)8.48--
Int881.27 GB (2xA100)9.05-44%+6.7%
Int448.86 GB11.32-66%+33.5%
BF16 + vLLM2xA10017.60-+107.5%
Key Finding: Int4 quantization provides the best overall value, offering 50-66% memory reduction with 20-33% speed improvement and minimal quality loss (< 2% on most benchmarks).

KV Cache Quantization Impact

Batch Size Scaling

Performance of Qwen-7B (BF16) generating 1024 tokens:
Batch SizeWithout KV CacheWith KV CacheSavingsBatch Size Increase
116.3 GB15.5 GB0.8 GB-
424.1 GB17.2 GB6.9 GB-
1631.7 GB22.3 GB9.4 GB-
3248.7 GB30.2 GB18.5 GB-
64OOM48.2 GB-2x enabled
100OOM72.4 GB-3x enabled
KV cache quantization enables 2-3x larger batch sizes, dramatically improving throughput for multi-user serving scenarios.

Sequence Length Scaling

Performance of Qwen-7B (BF16) with batch size 1:
Sequence LengthWithout KV CacheWith KV CacheSavings% Reduction
51215.2 GB15.0 GB0.2 GB1%
102416.3 GB15.5 GB0.8 GB5%
204817.6 GB15.8 GB1.8 GB10%
409619.5 GB16.6 GB2.9 GB15%
819223.2 GB17.6 GB5.6 GB24%
Memory savings scale with sequence length. For 8K+ token generation, KV cache quantization reduces memory by 20-30%.

Combined Quantization

Optimal memory efficiency combines GPTQ weight quantization with KV cache quantization:

Qwen-7B Memory Breakdown

ConfigurationGPU Memoryvs BF16Notes
BF16 baseline16.99 GB-No quantization
BF16 + KV cache15.5 GB-9%KV cache only
Int811.20 GB-34%GPTQ only
Int8 + KV cache~10.5 GB-38%Combined
Int48.21 GB-52%GPTQ only
Int4 + KV cache~7.5 GB-56%Maximum efficiency
For production deployments with high batch sizes or long sequences, combining Int4 GPTQ with KV cache quantization provides optimal memory efficiency.

Fine-tuning Memory Requirements

Memory usage for fine-tuning Qwen-7B with different methods (single A100-80GB):
Method256 tokens512 tokens1024 tokens2048 tokens4096 tokens8192 tokens
LoRA20.1GB / 1.2s20.4GB / 1.5s21.5GB / 2.8s23.8GB / 5.2s29.7GB / 10.1s36.6GB / 21.3s
LoRA (emb)33.7GB / 1.4s34.1GB / 1.6s35.2GB / 2.9s35.1GB / 5.3s39.2GB / 10.3s48.5GB / 21.7s
Q-LoRA11.5GB / 3.0s11.5GB / 3.0s12.3GB / 3.5s13.9GB / 7.0s16.9GB / 11.6s23.5GB / 22.3s
Full-parameter43.5GB / 2.1s43.5GB / 2.2s43.5GB / 2.2s43.5GB / 2.3s47.1GB / 2.8s48.3GB / 5.6s
LoRA (emb) includes trainable embedding and output layers, required when introducing new special tokens. Q-LoRA provides the best memory efficiency for fine-tuning.

Hardware Recommendations

GPU Selection Guide

RTX 3090 / RTX 4090 (24GB)

Recommended Models:
  • Qwen-7B-Chat-Int4 (8.2GB) ✅
  • Qwen-14B-Chat-Int4 (13GB) ✅
  • Qwen-7B-Chat-Int8 (11.2GB) ✅
Not Recommended:
  • Qwen-14B BF16 (30GB) ❌
  • Qwen-72B any variant ❌
Tip: Use Int4 + KV cache for optimal batch processing

RTX 3060 / RTX 3070 (8-12GB)

Recommended Models:
  • Qwen-1.8B-Chat-Int4 (2.9GB) ✅
  • Qwen-7B-Chat-Int4 (8.2GB) ✅ (tight fit)
Not Recommended:
  • Qwen-7B-Chat-Int8 (11.2GB) ❌
  • Qwen-14B any variant ❌
Tip: Stick to smaller models or use CPU offloading

Optimization Strategies

For Memory Efficiency

1

Start with Int4 GPTQ

Int4 provides 50-66% memory reduction with minimal quality loss
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True
)
2

Enable KV cache quantization for batch/long sequences

Add KV cache quantization for larger batches or longer sequences
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True,
    use_cache_quantization=True,
    use_cache_kernel=True,
    use_flash_attn=False
)
3

Profile your workload

Measure actual memory usage and adjust batch size/sequence length accordingly

For Speed

1

Use Int4 for best speed

Int4 provides 20-33% speed improvement over BF16
2

Enable Flash Attention (if not using KV cache)

Flash Attention 2 improves speed by 30-40% for long sequences
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    use_flash_attn=True
)
3

Use vLLM for production serving

vLLM provides 2x throughput improvement with PagedAttention
4

Optimize batch size

Find the sweet spot between throughput and latency for your hardware

For Quality

1

Use BF16 baseline for critical applications

No quantization for maximum quality
2

Use Int8 for quality-sensitive production

Less than 1% accuracy drop on most benchmarks
3

Evaluate Int4 on your domain

Int4 works well for most tasks but may degrade code generation (HumanEval)
4

Fine-tune after quantization if needed

Q-LoRA fine-tuning can recover some quality loss

Common Bottlenecks

Symptoms: CUDA out of memory errorsSolutions:
  1. Use Int4 quantization (52-66% memory reduction)
  2. Enable KV cache quantization
  3. Reduce batch size or sequence length
  4. Use gradient checkpointing (training)
  5. Consider smaller model variant
Symptoms: Slow tokens/second, high latencySolutions:
  1. Use Int4 quantization (20-33% faster)
  2. Enable Flash Attention 2 (if not using KV cache)
  3. Use vLLM for batch inference
  4. Increase batch size to amortize overhead
  5. Check for CPU-GPU bottlenecks
Symptoms: Poor outputs, hallucinations, reduced accuracySolutions:
  1. Switch from Int4 to Int8 or BF16
  2. Use more calibration data during quantization
  3. Fine-tune after quantization (Q-LoRA)
  4. Evaluate on domain-specific benchmarks
  5. Consider task-specific quantization
Symptoms: Low GPU utilization, poor scalingSolutions:
  1. Use tensor parallelism (vLLM, DeepSpeed)
  2. Increase batch size per GPU
  3. Use pipeline parallelism for very large models
  4. Check network bandwidth between GPUs
  5. Avoid DeepSpeed ZeRO 3 for multi-node (slow)

Benchmark Methodology

Inference Benchmarks

Setup:
  • Hardware: A100-SXM4-80G GPU (single GPU unless noted)
  • Software: PyTorch 2.0.1, CUDA 11.8, Flash-Attention 2
  • Task: Generate 2048 tokens from short prompt
  • Metric: Average tokens/second (includes prompt processing)
  • Script: profile.py

Quality Benchmarks

Benchmarks:
  • MMLU: 5-shot multiple-choice questions (57 subjects)
  • C-Eval: 5-shot Chinese evaluation (52 subjects)
  • GSM8K: 8-shot grade school math problems
  • HumanEval: 0-shot Python code generation
Evaluation: Best scores from official reports and OpenCompass

Memory Benchmarks

Measurement:
  • Peak GPU memory during generation (via torch.cuda.max_memory_allocated())
  • Includes model weights, KV cache, and temporary buffers
  • Single-batch inference unless otherwise noted

Next Steps

GPTQ Quantization

Implement GPTQ quantization based on these benchmarks

KV Cache Quantization

Enable KV cache quantization for your use case

Deployment Guide

Deploy optimized models in production

Fine-tuning

Fine-tune quantized models for your domain

Build docs developers (and LLMs) love