Skip to main content

Overview

GPTQ (Generalized Post-Training Quantization) is a weight quantization method that compresses model weights to 4-bit or 8-bit integers. Qwen uses AutoGPTQ for GPTQ quantization, achieving near-lossless compression with significant memory savings.

Benefits

Memory Reduction

Int4: ~4x smaller than BF16
Int8: ~2x smaller than BF16

Speed Improvement

Int4: Up to 40% faster
Int8: Similar or slightly slower

Quality Preservation

Less than 2% accuracy drop on most benchmarks

Easy to Use

Load pre-quantized models directly or quantize your own

Using Pre-Quantized Models

Installation

Install the required packages:
pip install auto-gptq optimum
AutoGPTQ depends on specific versions of PyTorch and CUDA. See version compatibility below.

Version Compatibility

# For torch==2.1
pip install torch==2.1.0
pip install auto-gptq>=0.5.1
pip install transformers>=4.35.0
pip install optimum>=1.14.0
pip install peft>=0.6.1
If you encounter installation issues with auto-gptq, check the official repository for pre-compiled wheels matching your environment.

Loading Quantized Models

Load and use a pre-quantized model:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    trust_remote_code=True
)

# Load Int4 model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True
).eval()

# Use the model
response, history = model.chat(tokenizer, "Hello!", history=None)
print(response)
The API is identical to the full-precision model. Simply change the model name to the Int4 or Int8 variant.

Available Models

All Qwen chat models are available in Int4 and Int8 variants:
ModelHuggingFaceModelScope
Qwen-1.8B-Chat-Int4🤗 Link🤖 Link
Qwen-7B-Chat-Int4🤗 Link🤖 Link
Qwen-14B-Chat-Int4🤗 Link🤖 Link
Qwen-72B-Chat-Int4🤗 Link🤖 Link

Quantizing Your Own Models

Quantize a fine-tuned or custom Qwen model using the provided run_gptq.py script.

Prerequisites

  1. A fine-tuned model (or base model)
  2. Calibration data in JSON format
  3. GPU with sufficient memory
If you fine-tuned with LoRA, merge the adapter weights before quantization. Q-LoRA models are already quantized and do not need this step.

Calibration Data Format

Prepare calibration data in the same format as fine-tuning data:
[
  {
    "id": "sample_0",
    "conversations": [
      {
        "from": "user",
        "value": "你好"
      },
      {
        "from": "assistant",
        "value": "你好!有什么我可以帮助你的吗?"
      }
    ]
  },
  {
    "id": "sample_1",
    "conversations": [
      {
        "from": "user",
        "value": "What is machine learning?"
      },
      {
        "from": "assistant",
        "value": "Machine learning is a subset of artificial intelligence..."
      }
    ]
  }
]
You can reuse your fine-tuning data or create a representative sample of diverse prompts (100-1000 samples recommended).

Running Quantization

Use the run_gptq.py script from the source repository:
python run_gptq.py \
    --model_name_or_path /path/to/your/model \
    --data_path /path/to/calibration_data.json \
    --out_path /path/to/output \
    --bits 4  # 4 for Int4, 8 for Int8

Script Parameters

ParameterDescriptionDefault
--model_name_or_pathPath to the model to quantizeRequired
--data_pathPath to calibration data JSON fileRequired
--out_pathOutput directory for quantized modelRequired
--bitsQuantization bits (4 or 8)4
--group-sizeGroup size for quantization128
--max_lenMaximum sequence length for calibration8192
Quantization requires GPU and may take several hours depending on model size and calibration data. For Qwen-7B with 1000 samples, expect ~2-3 hours on an A100 GPU.

Post-Quantization Steps

After quantization completes:
  1. Copy support files to the output directory:
    cp /path/to/source/*.py /path/to/output/
    cp /path/to/source/*.cu /path/to/output/
    cp /path/to/source/*.cpp /path/to/output/
    cp /path/to/source/generation_config.json /path/to/output/
    
  2. Update config.json by copying from the corresponding official quantized model:
    # For a 7B model quantized to Int4:
    wget https://huggingface.co/Qwen/Qwen-7B-Chat-Int4/raw/main/config.json \
         -O /path/to/output/config.json
    
  3. Rename the checkpoint:
    mv /path/to/output/gptq.safetensors /path/to/output/model.safetensors
    

Testing the Quantized Model

Load and test your quantized model:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "/path/to/output",
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "/path/to/output",
    device_map="auto",
    trust_remote_code=True
).eval()

response, history = model.chat(tokenizer, "你好", history=None)
print(response)

Quantization Configuration

The run_gptq.py script uses the following GPTQ configuration:
quantize_config = BaseQuantizeConfig(
    bits=4,                    # 4 or 8
    group_size=128,            # Group size for quantization
    damp_percent=0.01,         # Dampening for numerical stability
    desc_act=False,            # Disable for faster inference
    static_groups=False,       # Dynamic group selection
    sym=True,                  # Symmetric quantization
    true_sequential=True,      # Sequential quantization
    model_file_base_name="model"
)
desc_act=False significantly speeds up inference with minimal perplexity increase. Keep this setting unless accuracy is critical.

Performance Impact

GPTQ quantization maintains excellent accuracy:

Benchmark Comparison (Qwen-7B-Chat)

PrecisionMMLUC-EvalGSM8KHumanEvalMemorySpeed
BF1655.859.750.337.216.99GB40.93 tok/s
Int855.4 (-0.4)59.4 (-0.3)48.3 (-2.0)34.8 (-2.4)11.20GB37.47 tok/s
Int455.1 (-0.7)59.2 (-0.5)49.7 (-0.6)29.9 (-7.3)8.21GB50.09 tok/s
Key Findings:
  • Int8 preserves 99% of BF16 quality with 34% memory reduction
  • Int4 achieves 52% memory reduction with minimal quality loss on most tasks
  • HumanEval (code generation) sees larger degradation with Int4

Speed Considerations

Known Issue: Models loaded via AutoModelForCausalLM.from_pretrained run ~20% slower than models loaded directly through AutoGPTQ. This is a known issue reported to the HuggingFace team.

Troubleshooting

Solution: Check your auto-gptq version and ensure compatibility with your PyTorch version:
pip show auto-gptq torch
Refer to the version compatibility table above.
Solution: Reduce calibration data size or use a GPU with more memory:
# Use fewer calibration samples
python run_gptq.py --max_len 4096 ...  # Reduce max length
Quantization requires more memory than inference.
Solution: Ensure you completed all post-quantization steps:
  1. Copied all .py, .cu, .cpp files
  2. Updated config.json from official model
  3. Renamed gptq.safetensors to model.safetensors
Solution: This is a known issue with models loaded via Transformers. The model should still be faster than BF16 in most cases. For optimal speed, consider using vLLM for deployment.

Best Practices

1

Choose the right precision

  • Int8 for production systems requiring high accuracy
  • Int4 for memory-constrained environments or experimentation
2

Use diverse calibration data

Include samples representative of your use case (100-1000 samples)
3

Validate after quantization

Test the quantized model on your evaluation set before deployment
4

Consider use case

Code generation tasks may see larger quality drops with Int4

Next Steps

KV Cache Quantization

Further reduce memory usage with KV cache quantization

Performance Benchmarks

Detailed performance analysis and optimization tips

Fine-tuning

Fine-tune Qwen models before quantization

Deployment

Deploy quantized models in production

Build docs developers (and LLMs) love