Why quantization?
Quantization provides several benefits:- Reduced memory usage - Fit larger models or longer contexts in available GPU memory
- Faster inference - Lower precision arithmetic can be faster on modern hardware
- Lower costs - Serve models on smaller/fewer GPUs
- Higher throughput - Process more requests per second with the same hardware
For the best quantization experience, we recommend LLM Compressor, a library for optimizing models for deployment with vLLM that supports FP8, INT8, INT4, and other formats.
Supported quantization methods
vLLM supports the following quantization formats:Popular methods
FP8
8-bit floating point (W8A8)Best for: Ada/Hopper GPUs, AMD GPUsMinimal accuracy loss with significant speedup
INT8
8-bit integer (W8A8)Best for: Turing+ GPUs, CPUsGood balance of compression and accuracy
INT4
4-bit integer (W4A16)Best for: Maximizing compressionHighest compression ratio
All supported methods
| Method | Description | Key Use Case |
|---|---|---|
| AutoAWQ | Activation-aware Weight Quantization | Balanced INT4 quantization |
| GPTQ | GPT Quantization | INT4 weights, widely supported |
| FP8 | 8-bit floating point | Ada/Hopper GPU acceleration |
| INT8 | 8-bit integer | Broad hardware support |
| INT4 | 4-bit integer | Maximum compression |
| BitsAndBytes | Dynamic quantization | Easy to use, no calibration |
| GGUF | GPT-Generated Unified Format | llama.cpp compatibility |
| Quantized KV Cache | KV cache compression | Longer context windows |
| NVIDIA TensorRT-Model Optimizer | TensorRT optimizations | NVIDIA GPU optimization |
| AMD Quark | AMD-optimized quantization | AMD GPU optimization |
| Intel Neural Compressor | Intel optimization | Intel CPU/GPU optimization |
| TorchAO | PyTorch native quantization | Experimental PyTorch integration |
Hardware compatibility
The table below shows quantization method compatibility with different hardware:| Method | Volta (SM 7.0) | Turing (SM 7.5) | Ampere (SM 8.0/8.6) | Ada (SM 8.9) | Hopper (SM 9.0) | AMD GPU | Intel GPU | x86 CPU |
|---|---|---|---|---|---|---|---|---|
| AWQ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
| GPTQ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
| Marlin (GPTQ/AWQ/FP8/FP4) | ❌ | ✅* | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| INT8 (W8A8) | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ |
| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ |
| BitsAndBytes | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| DeepSpeedFP | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| GGUF | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
*Turing does not support Marlin MXFP4.For Google TPU quantization support, see the TPU-Inference documentation.
Quick start
FP8 quantization (recommended for Ada/Hopper)
GPTQ quantization
AWQ quantization
Creating quantized models
Using LLM Compressor
LLM Compressor is the recommended tool for quantizing models for vLLM:Using AutoGPTQ
Quantized KV cache
Quantize the KV cache to support longer context windows:fp8- 8-bit floating pointfp8_e5m2- FP8 with 5 exponent bits, 2 mantissa bitsfp8_e4m3- FP8 with 4 exponent bits, 3 mantissa bits
Performance comparison
Typical memory and performance characteristics:| Quantization | Memory vs FP16 | Speed vs FP16 | Accuracy vs FP16 |
|---|---|---|---|
| FP8 (W8A8) | ~50% | 1.5-2x faster | Minimal loss (<1%) |
| INT8 (W8A8) | ~50% | 1.3-1.5x faster | Minimal loss (~1%) |
| INT4 (W4A16) | ~25% | 1.1-1.3x faster | Small loss (2-5%) |
| GPTQ (4-bit) | ~25% | 1.2-1.4x faster | Small loss (2-5%) |
| AWQ (4-bit) | ~25% | 1.2-1.4x faster | Minimal loss (<2%) |
Actual performance varies based on model size, hardware, batch size, and sequence length. Always benchmark your specific use case.
Choosing a quantization method
For Ada/Hopper GPUs (RTX 4090, H100, etc.)
For Ada/Hopper GPUs (RTX 4090, H100, etc.)
Recommended: FP8
- Native FP8 tensor cores provide excellent performance
- Minimal accuracy loss
- Easy to use with LLM Compressor
For Ampere GPUs (A100, RTX 3090, etc.)
For Ampere GPUs (A100, RTX 3090, etc.)
Recommended: INT8
- Good balance of speed and accuracy
- Broad support across models
For AMD GPUs
For AMD GPUs
Recommended: FP8
- AMD GPUs have good FP8 support
- Use AMD Quark for optimized quantization
For CPUs
For CPUs
Recommended: INT8
- Good CPU performance
- Use Intel Neural Compressor for Intel CPUs
For maximum compression
For maximum compression
Recommended: INT4 or GPTQ/AWQ
- ~4x smaller than FP16
- Good for very large models
- Accept some accuracy tradeoff
Custom quantization plugins
vLLM supports registering custom quantization methods using the@register_quantization_config decorator.
Example: Custom quantization plugin
Example: Custom quantization plugin
Best practices
Quantization workflow
Quantization workflow
- Evaluate baseline: Test the full-precision model first
- Choose method: Based on hardware and compression needs
- Quantize: Use LLM Compressor or model-specific tools
- Validate: Check accuracy on representative tasks
- Benchmark: Measure throughput and latency improvements
- Iterate: Adjust quantization parameters if needed
Accuracy preservation
Accuracy preservation
- Use calibration data similar to your inference distribution
- FP8 and INT8 typically have <1% accuracy loss
- INT4 may require more careful tuning
- Consider per-channel vs per-tensor quantization
- Test on downstream tasks, not just perplexity
Performance optimization
Performance optimization
- Combine weight quantization with KV cache quantization
- Use appropriate batch sizes for your hardware
- Profile to identify bottlenecks
- Consider Marlin kernels for GPTQ/AWQ on supported GPUs
- Enable tensor parallelism for large models
Troubleshooting
Model accuracy is poor after quantization
Model accuracy is poor after quantization
- Try a higher precision method (INT8 instead of INT4)
- Use more calibration data
- Check if the model architecture is well-supported
- Try AWQ instead of GPTQ for better activation-aware quantization
Quantized model is slow
Quantized model is slow
- Ensure you’re using the right hardware for the quantization method
- Check that optimized kernels (Marlin, etc.) are being used
- Verify batch size is appropriate
- Profile to identify bottlenecks
Out of memory with quantized model
Out of memory with quantized model
- Enable KV cache quantization
- Reduce batch size or max sequence length
- Try higher compression (INT4 instead of INT8)
- Enable tensor parallelism to distribute across GPUs
Next steps
LLM Compressor
Quantize models for vLLM deployment
FP8 quantization
High-performance 8-bit floating point
GPTQ quantization
Popular 4-bit quantization method
Quantized KV cache
Extend context length with cache quantization