Skip to main content

What is Quantization?

Quantization reduces the precision of model weights and activations from floating-point (BF16/FP32) to lower-bit integers (Int8/Int4), significantly decreasing memory usage and improving inference speed with minimal impact on model performance.

Why Use Quantization?

Qwen models can be quantized to run efficiently on hardware with limited GPU memory:

Reduced Memory

Int4 models use ~75% less memory than BF16

Faster Inference

Up to 1.4x speed improvement with Int4

Minimal Quality Loss

Less than 2% accuracy drop on benchmarks

Quantization Methods

Qwen supports two complementary quantization techniques:

GPTQ Quantization

GPTQ quantizes model weights to Int4 or Int8, reducing the model size while maintaining accuracy. It uses the AutoGPTQ library for post-training quantization.

GPTQ Details

Learn how to use GPTQ quantization for Int4 and Int8 models

KV Cache Quantization

KV cache quantization compresses the key-value attention cache from FP16 to Int8, enabling larger batch sizes and longer sequences without running out of memory.

KV Cache Details

Learn how to enable KV cache quantization for memory-efficient inference

Available Quantized Models

Qwen provides pre-quantized models for immediate use:
ModelMemory (2048 tokens)Speed (tokens/s)
Qwen-1.8B-Chat-Int42.91GB71.07
Qwen-1.8B-Chat-Int83.48GB55.56
Qwen-1.8B-Chat (BF16)4.23GB54.09
Links: Int4 🤗 Int8 🤗
Performance measured on A100-SXM4-80G GPU with PyTorch 2.0.1, CUDA 11.8, and Flash-Attention 2.

Model Quality Comparison

Quantization maintains strong performance across benchmarks:
ModelQuantizationMMLUC-EvalGSM8KHumanEval
Qwen-7B-ChatBF1655.859.750.337.2
Qwen-7B-ChatInt855.459.448.334.8
Qwen-7B-ChatInt455.159.249.729.9
Qwen-14B-ChatBF1664.669.860.143.9
Qwen-14B-ChatInt863.668.660.048.2
Qwen-14B-ChatInt463.369.059.845.7
Qwen-72B-ChatBF1674.480.176.464.6
Qwen-72B-ChatInt873.580.173.562.2
Qwen-72B-ChatInt473.480.175.361.6
Int8 quantization provides the best balance between memory savings and accuracy retention, typically losing less than 1% on most benchmarks.

Quick Start

Load a pre-quantized model:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True
).eval()

response, history = model.chat(tokenizer, "你好", history=None)
print(response)

Combining Techniques

For maximum memory efficiency, combine GPTQ and KV cache quantization:
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True,
    use_cache_quantization=True,
    use_cache_kernel=True,
    use_flash_attn=False  # KV cache quantization incompatible with flash attention
).eval()
KV cache quantization and Flash Attention cannot be enabled simultaneously. Flash Attention is automatically disabled when KV cache quantization is enabled.

Next Steps

GPTQ Quantization

Detailed guide on using and creating GPTQ quantized models

KV Cache Quantization

Enable KV cache quantization for larger batch sizes

Performance Benchmarks

Complete performance analysis and optimization tips

Quickstart Guide

Get started with Qwen models

Build docs developers (and LLMs) love