Overview
GPTQ (Generalized Post-Training Quantization) is a weight quantization method that compresses model weights to 4-bit or 8-bit integers. Qwen uses AutoGPTQ for GPTQ quantization, achieving near-lossless compression with significant memory savings.Benefits
Memory Reduction
Int4: ~4x smaller than BF16
Int8: ~2x smaller than BF16
Int8: ~2x smaller than BF16
Speed Improvement
Int4: Up to 40% faster
Int8: Similar or slightly slower
Int8: Similar or slightly slower
Quality Preservation
Less than 2% accuracy drop on most benchmarks
Easy to Use
Load pre-quantized models directly or quantize your own
Using Pre-Quantized Models
Installation
Install the required packages:AutoGPTQ depends on specific versions of PyTorch and CUDA. See version compatibility below.
Version Compatibility
- PyTorch 2.1
- PyTorch 2.0
Loading Quantized Models
Load and use a pre-quantized model:Available Models
All Qwen chat models are available in Int4 and Int8 variants:- Int4 Models
- Int8 Models
Quantizing Your Own Models
Quantize a fine-tuned or custom Qwen model using the providedrun_gptq.py script.
Prerequisites
- A fine-tuned model (or base model)
- Calibration data in JSON format
- GPU with sufficient memory
Calibration Data Format
Prepare calibration data in the same format as fine-tuning data:Running Quantization
Use therun_gptq.py script from the source repository:
Script Parameters
| Parameter | Description | Default |
|---|---|---|
--model_name_or_path | Path to the model to quantize | Required |
--data_path | Path to calibration data JSON file | Required |
--out_path | Output directory for quantized model | Required |
--bits | Quantization bits (4 or 8) | 4 |
--group-size | Group size for quantization | 128 |
--max_len | Maximum sequence length for calibration | 8192 |
Quantization requires GPU and may take several hours depending on model size and calibration data. For Qwen-7B with 1000 samples, expect ~2-3 hours on an A100 GPU.
Post-Quantization Steps
After quantization completes:-
Copy support files to the output directory:
-
Update config.json by copying from the corresponding official quantized model:
-
Rename the checkpoint:
Testing the Quantized Model
Load and test your quantized model:Quantization Configuration
Therun_gptq.py script uses the following GPTQ configuration:
Performance Impact
GPTQ quantization maintains excellent accuracy:Benchmark Comparison (Qwen-7B-Chat)
| Precision | MMLU | C-Eval | GSM8K | HumanEval | Memory | Speed |
|---|---|---|---|---|---|---|
| BF16 | 55.8 | 59.7 | 50.3 | 37.2 | 16.99GB | 40.93 tok/s |
| Int8 | 55.4 (-0.4) | 59.4 (-0.3) | 48.3 (-2.0) | 34.8 (-2.4) | 11.20GB | 37.47 tok/s |
| Int4 | 55.1 (-0.7) | 59.2 (-0.5) | 49.7 (-0.6) | 29.9 (-7.3) | 8.21GB | 50.09 tok/s |
Key Findings:
- Int8 preserves 99% of BF16 quality with 34% memory reduction
- Int4 achieves 52% memory reduction with minimal quality loss on most tasks
- HumanEval (code generation) sees larger degradation with Int4
Speed Considerations
Troubleshooting
ImportError: cannot import name 'AutoGPTQForCausalLM'
ImportError: cannot import name 'AutoGPTQForCausalLM'
Solution: Check your Refer to the version compatibility table above.
auto-gptq version and ensure compatibility with your PyTorch version:CUDA out of memory during quantization
CUDA out of memory during quantization
Solution: Reduce calibration data size or use a GPU with more memory:Quantization requires more memory than inference.
Model outputs are gibberish after quantization
Model outputs are gibberish after quantization
Solution: Ensure you completed all post-quantization steps:
- Copied all
.py,.cu,.cppfiles - Updated
config.jsonfrom official model - Renamed
gptq.safetensorstomodel.safetensors
Quantized model is slower than expected
Quantized model is slower than expected
Solution: This is a known issue with models loaded via Transformers. The model should still be faster than BF16 in most cases. For optimal speed, consider using vLLM for deployment.
Best Practices
Choose the right precision
- Int8 for production systems requiring high accuracy
- Int4 for memory-constrained environments or experimentation
Next Steps
KV Cache Quantization
Further reduce memory usage with KV cache quantization
Performance Benchmarks
Detailed performance analysis and optimization tips
Fine-tuning
Fine-tune Qwen models before quantization
Deployment
Deploy quantized models in production