What is Quantization?
Quantization reduces the precision of model weights and activations from floating-point (BF16/FP32) to lower-bit integers (Int8/Int4), significantly decreasing memory usage and improving inference speed with minimal impact on model performance.Why Use Quantization?
Qwen models can be quantized to run efficiently on hardware with limited GPU memory:Reduced Memory
Int4 models use ~75% less memory than BF16
Faster Inference
Up to 1.4x speed improvement with Int4
Minimal Quality Loss
Less than 2% accuracy drop on benchmarks
Quantization Methods
Qwen supports two complementary quantization techniques:GPTQ Quantization
GPTQ quantizes model weights to Int4 or Int8, reducing the model size while maintaining accuracy. It uses the AutoGPTQ library for post-training quantization.GPTQ Details
Learn how to use GPTQ quantization for Int4 and Int8 models
KV Cache Quantization
KV cache quantization compresses the key-value attention cache from FP16 to Int8, enabling larger batch sizes and longer sequences without running out of memory.KV Cache Details
Learn how to enable KV cache quantization for memory-efficient inference
Available Quantized Models
Qwen provides pre-quantized models for immediate use:- Qwen-1.8B
- Qwen-7B
- Qwen-14B
- Qwen-72B
Performance measured on A100-SXM4-80G GPU with PyTorch 2.0.1, CUDA 11.8, and Flash-Attention 2.
Model Quality Comparison
Quantization maintains strong performance across benchmarks:| Model | Quantization | MMLU | C-Eval | GSM8K | HumanEval |
|---|---|---|---|---|---|
| Qwen-7B-Chat | BF16 | 55.8 | 59.7 | 50.3 | 37.2 |
| Qwen-7B-Chat | Int8 | 55.4 | 59.4 | 48.3 | 34.8 |
| Qwen-7B-Chat | Int4 | 55.1 | 59.2 | 49.7 | 29.9 |
| Qwen-14B-Chat | BF16 | 64.6 | 69.8 | 60.1 | 43.9 |
| Qwen-14B-Chat | Int8 | 63.6 | 68.6 | 60.0 | 48.2 |
| Qwen-14B-Chat | Int4 | 63.3 | 69.0 | 59.8 | 45.7 |
| Qwen-72B-Chat | BF16 | 74.4 | 80.1 | 76.4 | 64.6 |
| Qwen-72B-Chat | Int8 | 73.5 | 80.1 | 73.5 | 62.2 |
| Qwen-72B-Chat | Int4 | 73.4 | 80.1 | 75.3 | 61.6 |
Quick Start
Load a pre-quantized model:Combining Techniques
For maximum memory efficiency, combine GPTQ and KV cache quantization:Next Steps
GPTQ Quantization
Detailed guide on using and creating GPTQ quantized models
KV Cache Quantization
Enable KV cache quantization for larger batch sizes
Performance Benchmarks
Complete performance analysis and optimization tips
Quickstart Guide
Get started with Qwen models