Quantization Methods
TensorRT-LLM supports the following quantization recipes:FP8 Quantization
- FP8 Per Tensor
- FP8 Block Scaling
- FP8 Rowwise
- FP8 KV Cache
FP4 Quantization
- NVFP4 (NVIDIA FP4)
- MXFP4 (MX Format FP4)
- NVFP4 KV Cache
INT4 Quantization
- W4A16 AWQ
- W4A8 AWQ
- W4A16 GPTQ
- W4A8 GPTQ
INT8 Quantization
- W8A16 Weight-Only
- W8A8 SmoothQuant
- INT8 KV Cache
Quick Start
Running Pre-Quantized Models
The simplest way to use quantization is to load pre-quantized models from the NVIDIA Model Optimizer collection:TensorRT-LLM can directly run pre-quantized models without any additional configuration. The quantization settings are automatically detected from the checkpoint.
FP8 KV Cache
You can enable FP8 KV cache manually, even for checkpoints that don’t have it enabled by default:NVFP4 KV Cache
Offline Quantization with ModelOpt
If a pre-quantized model is not available on HuggingFace, you can quantize it offline using NVIDIA Model Optimizer.FP8 Quantization
NVFP4 KV Cache Quantization
Currently, TensorRT-LLM only supports FP8 weight/activation quantization when NVFP4 KV cache is enabled. Therefore,
--quant fp8 is required.Hardware Support Matrix
| GPU Architecture | NVFP4 | MXFP4 | FP8 (per tensor) | FP8 (block scaling) | FP8 (rowwise) | FP8 KV Cache | NVFP4 KV Cache | W4A8 AWQ | W4A16 AWQ | W4A8 GPTQ | W4A16 GPTQ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Blackwell (sm120) | ✓ | ✓ | ✓ | - | - | ✓ | - | - | - | - | - |
| Blackwell (sm100/103) | ✓ | ✓ | ✓ | ✓ | - | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Hopper | - | - | ✓ | ✓ | ✓ | ✓ | - | ✓ | ✓ | ✓ | ✓ |
| Ada Lovelace | - | - | ✓ | - | - | ✓ | - | ✓ | ✓ | ✓ | ✓ |
| Ampere | - | - | - | - | - | ✓ | - | - | ✓ | - | ✓ |
FP8 blockwise scaling GEMM kernels for sm100/103 use the MXFP8 recipe (E4M3 act/weight and UE8M0 act/weight scale), which differs from SM90’s FP8 recipe (E4M3 act/weight and FP32 act/weight scale).
Model Support Matrix
Quantization support varies by model architecture. Here are some examples:- LLaMA Models
- Qwen Models
- Other Models
| Model | NVFP4 | FP8 | FP8 KV Cache | W4A16 AWQ | W4A16 GPTQ |
|---|---|---|---|---|---|
| LLaMA | ✓ | ✓ | ✓ | - | ✓ |
| LLaMA-v2 | ✓ | ✓ | ✓ | ✓ | ✓ |
| LLaMA 3 | - | - | ✓ | ✓ | - |
| LLaMA 4 | ✓ | ✓ | ✓ | - | - |
For multimodal models (BLIP2, LLaVA, VILA, Nougat), the vision component uses FP16 by default. The language component determines which quantization methods are supported.
Quantization Techniques
AWQ (Activation-aware Weight Quantization)
AWQ quantizes weights to 4-bit while preserving activation-aware importance:- W4A16 AWQ: 4-bit weights, 16-bit activations
- W4A8 AWQ: 4-bit weights, 8-bit activations
- Per-group quantization for better accuracy
- Minimal accuracy loss compared to FP16
SmoothQuant
SmoothQuant balances quantization difficulty between weights and activations:- W8A8 quantization (8-bit weights and activations)
- Per-channel or per-tensor scaling
- Dynamic per-token quantization option
- Better accuracy than naive INT8 quantization
GPTQ (Generative Pre-trained Transformer Quantization)
GPTQ is a one-shot weight quantization method:- W4A16 GPTQ: 4-bit weights, 16-bit activations
- W4A8 GPTQ: 4-bit weights, 8-bit activations
- Per-group quantization
- Layer-wise quantization for optimal accuracy
Python API Reference
Best Practices
Choosing the Right Quantization Method
Choosing the Right Quantization Method
- For maximum throughput: FP8 quantization on Hopper/Blackwell GPUs
- For memory-constrained scenarios: W4A16 AWQ or GPTQ
- For balanced performance: W4A8 AWQ with FP8 KV cache
- For minimal accuracy loss: FP8 per-tensor or SmoothQuant
KV Cache Quantization
KV Cache Quantization
- FP8 KV cache reduces memory usage by 2x vs FP16
- NVFP4 KV cache reduces memory usage by 4x vs FP16
- Minimal impact on generation quality
- Essential for long-context applications
Performance Tuning
Performance Tuning
- FP8 quantization provides best performance on Hopper+ GPUs
- Use pre-quantized models when available (faster loading)
- Enable KV cache quantization for long sequences
- Test accuracy on your specific tasks before deployment
Additional Resources
Pre-quantized Models
Browse NVIDIA’s collection of pre-quantized models
Model Optimizer
Quantize your own models with NVIDIA Model Optimizer
ModelOpt Support Matrix
Check which models and quantization methods are supported