Overview
This page provides comprehensive performance benchmarks for Qwen models across different quantization methods, model sizes, and hardware configurations. All measurements are from production testing on real hardware.Test Environment: A100-SXM4-80G GPU, PyTorch 2.0.1, CUDA 11.8, Flash-Attention 2 (except where noted)
Memory and Speed Comparison
Qwen-1.8B
- Memory & Speed
- Quality
- Use Case
| Precision | GPU Memory | Speed (tokens/s) | vs BF16 Memory | vs BF16 Speed |
|---|---|---|---|---|
| BF16 | 4.23 GB | 54.09 | - | - |
| Int8 | 3.48 GB | 55.56 | -18% | +2.7% |
| Int4 | 2.91 GB | 71.07 | -31% | +31.4% |
Qwen-7B
- Memory & Speed
- Quality
- Use Case
| Precision | GPU Memory | Speed (tokens/s) | vs BF16 Memory | vs BF16 Speed |
|---|---|---|---|---|
| BF16 | 16.99 GB | 40.93 | - | - |
| Int8 | 11.20 GB | 37.47 | -34% | -8.5% |
| Int4 | 8.21 GB | 50.09 | -52% | +22.4% |
Qwen-14B
- Memory & Speed
- Quality
- Use Case
| Precision | GPU Memory | Speed (tokens/s) | vs BF16 Memory | vs BF16 Speed |
|---|---|---|---|---|
| BF16 | 30.15 GB | 32.22 | - | - |
| Int8 | 18.81 GB | 29.28 | -38% | -9.1% |
| Int4 | 13.01 GB | 38.72 | -57% | +20.2% |
Qwen-72B
- Memory & Speed
- Quality
- Use Case
| Precision | GPU Memory | Speed (tokens/s) | vs BF16 Memory | vs BF16 Speed |
|---|---|---|---|---|
| BF16 | 144.69 GB (2xA100) | 8.48 | - | - |
| Int8 | 81.27 GB (2xA100) | 9.05 | -44% | +6.7% |
| Int4 | 48.86 GB | 11.32 | -66% | +33.5% |
| BF16 + vLLM | 2xA100 | 17.60 | - | +107.5% |
KV Cache Quantization Impact
Batch Size Scaling
Performance of Qwen-7B (BF16) generating 1024 tokens:| Batch Size | Without KV Cache | With KV Cache | Savings | Batch Size Increase |
|---|---|---|---|---|
| 1 | 16.3 GB | 15.5 GB | 0.8 GB | - |
| 4 | 24.1 GB | 17.2 GB | 6.9 GB | - |
| 16 | 31.7 GB | 22.3 GB | 9.4 GB | - |
| 32 | 48.7 GB | 30.2 GB | 18.5 GB | - |
| 64 | OOM | 48.2 GB | - | 2x enabled |
| 100 | OOM | 72.4 GB | - | 3x enabled |
KV cache quantization enables 2-3x larger batch sizes, dramatically improving throughput for multi-user serving scenarios.
Sequence Length Scaling
Performance of Qwen-7B (BF16) with batch size 1:| Sequence Length | Without KV Cache | With KV Cache | Savings | % Reduction |
|---|---|---|---|---|
| 512 | 15.2 GB | 15.0 GB | 0.2 GB | 1% |
| 1024 | 16.3 GB | 15.5 GB | 0.8 GB | 5% |
| 2048 | 17.6 GB | 15.8 GB | 1.8 GB | 10% |
| 4096 | 19.5 GB | 16.6 GB | 2.9 GB | 15% |
| 8192 | 23.2 GB | 17.6 GB | 5.6 GB | 24% |
Combined Quantization
Optimal memory efficiency combines GPTQ weight quantization with KV cache quantization:Qwen-7B Memory Breakdown
| Configuration | GPU Memory | vs BF16 | Notes |
|---|---|---|---|
| BF16 baseline | 16.99 GB | - | No quantization |
| BF16 + KV cache | 15.5 GB | -9% | KV cache only |
| Int8 | 11.20 GB | -34% | GPTQ only |
| Int8 + KV cache | ~10.5 GB | -38% | Combined |
| Int4 | 8.21 GB | -52% | GPTQ only |
| Int4 + KV cache | ~7.5 GB | -56% | Maximum efficiency |
Fine-tuning Memory Requirements
Memory usage for fine-tuning Qwen-7B with different methods (single A100-80GB):| Method | 256 tokens | 512 tokens | 1024 tokens | 2048 tokens | 4096 tokens | 8192 tokens |
|---|---|---|---|---|---|---|
| LoRA | 20.1GB / 1.2s | 20.4GB / 1.5s | 21.5GB / 2.8s | 23.8GB / 5.2s | 29.7GB / 10.1s | 36.6GB / 21.3s |
| LoRA (emb) | 33.7GB / 1.4s | 34.1GB / 1.6s | 35.2GB / 2.9s | 35.1GB / 5.3s | 39.2GB / 10.3s | 48.5GB / 21.7s |
| Q-LoRA | 11.5GB / 3.0s | 11.5GB / 3.0s | 12.3GB / 3.5s | 13.9GB / 7.0s | 16.9GB / 11.6s | 23.5GB / 22.3s |
| Full-parameter | 43.5GB / 2.1s | 43.5GB / 2.2s | 43.5GB / 2.2s | 43.5GB / 2.3s | 47.1GB / 2.8s | 48.3GB / 5.6s |
LoRA (emb) includes trainable embedding and output layers, required when introducing new special tokens. Q-LoRA provides the best memory efficiency for fine-tuning.
Hardware Recommendations
GPU Selection Guide
- Consumer GPUs
- Data Center GPUs
- Multi-GPU
RTX 3090 / RTX 4090 (24GB)
Recommended Models:
- Qwen-7B-Chat-Int4 (8.2GB) ✅
- Qwen-14B-Chat-Int4 (13GB) ✅
- Qwen-7B-Chat-Int8 (11.2GB) ✅
- Qwen-14B BF16 (30GB) ❌
- Qwen-72B any variant ❌
RTX 3060 / RTX 3070 (8-12GB)
Recommended Models:
- Qwen-1.8B-Chat-Int4 (2.9GB) ✅
- Qwen-7B-Chat-Int4 (8.2GB) ✅ (tight fit)
- Qwen-7B-Chat-Int8 (11.2GB) ❌
- Qwen-14B any variant ❌
Optimization Strategies
For Memory Efficiency
Enable KV cache quantization for batch/long sequences
Add KV cache quantization for larger batches or longer sequences
For Speed
Enable Flash Attention (if not using KV cache)
Flash Attention 2 improves speed by 30-40% for long sequences
For Quality
Evaluate Int4 on your domain
Int4 works well for most tasks but may degrade code generation (HumanEval)
Common Bottlenecks
OOM during inference
OOM during inference
Symptoms: CUDA out of memory errorsSolutions:
- Use Int4 quantization (52-66% memory reduction)
- Enable KV cache quantization
- Reduce batch size or sequence length
- Use gradient checkpointing (training)
- Consider smaller model variant
Slow generation speed
Slow generation speed
Symptoms: Slow tokens/second, high latencySolutions:
- Use Int4 quantization (20-33% faster)
- Enable Flash Attention 2 (if not using KV cache)
- Use vLLM for batch inference
- Increase batch size to amortize overhead
- Check for CPU-GPU bottlenecks
Quality degradation
Quality degradation
Symptoms: Poor outputs, hallucinations, reduced accuracySolutions:
- Switch from Int4 to Int8 or BF16
- Use more calibration data during quantization
- Fine-tune after quantization (Q-LoRA)
- Evaluate on domain-specific benchmarks
- Consider task-specific quantization
Multi-GPU inefficiency
Multi-GPU inefficiency
Symptoms: Low GPU utilization, poor scalingSolutions:
- Use tensor parallelism (vLLM, DeepSpeed)
- Increase batch size per GPU
- Use pipeline parallelism for very large models
- Check network bandwidth between GPUs
- Avoid DeepSpeed ZeRO 3 for multi-node (slow)
Benchmark Methodology
Inference Benchmarks
Setup:
- Hardware: A100-SXM4-80G GPU (single GPU unless noted)
- Software: PyTorch 2.0.1, CUDA 11.8, Flash-Attention 2
- Task: Generate 2048 tokens from short prompt
- Metric: Average tokens/second (includes prompt processing)
- Script: profile.py
Quality Benchmarks
Benchmarks:
- MMLU: 5-shot multiple-choice questions (57 subjects)
- C-Eval: 5-shot Chinese evaluation (52 subjects)
- GSM8K: 8-shot grade school math problems
- HumanEval: 0-shot Python code generation
Memory Benchmarks
Measurement:
- Peak GPU memory during generation (via
torch.cuda.max_memory_allocated()) - Includes model weights, KV cache, and temporary buffers
- Single-batch inference unless otherwise noted
Next Steps
GPTQ Quantization
Implement GPTQ quantization based on these benchmarks
KV Cache Quantization
Enable KV cache quantization for your use case
Deployment Guide
Deploy optimized models in production
Fine-tuning
Fine-tune quantized models for your domain