Skip to main content
Quantization is a technique used to reduce memory footprint and computational cost by converting the model’s weights and activations from high-precision floating-point numbers (like BF16) to lower-precision data types, such as INT8, FP8, or FP4. TensorRT-LLM offers a variety of quantization recipes to optimize LLM inference, supporting both weight-only and weight-activation quantization methods.

Quantization Methods

TensorRT-LLM supports the following quantization recipes:

FP8 Quantization

  • FP8 Per Tensor
  • FP8 Block Scaling
  • FP8 Rowwise
  • FP8 KV Cache

FP4 Quantization

  • NVFP4 (NVIDIA FP4)
  • MXFP4 (MX Format FP4)
  • NVFP4 KV Cache

INT4 Quantization

  • W4A16 AWQ
  • W4A8 AWQ
  • W4A16 GPTQ
  • W4A8 GPTQ

INT8 Quantization

  • W8A16 Weight-Only
  • W8A8 SmoothQuant
  • INT8 KV Cache

Quick Start

Running Pre-Quantized Models

The simplest way to use quantization is to load pre-quantized models from the NVIDIA Model Optimizer collection:
from tensorrt_llm import LLM

# Load an FP8-quantized model directly
llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8')
outputs = llm.generate("Hello, my name is")
TensorRT-LLM can directly run pre-quantized models without any additional configuration. The quantization settings are automatically detected from the checkpoint.

FP8 KV Cache

You can enable FP8 KV cache manually, even for checkpoints that don’t have it enabled by default:
from tensorrt_llm import LLM
from tensorrt_llm.llmapi import KvCacheConfig

llm = LLM(
    model='/path/to/model',
    kv_cache_config=KvCacheConfig(dtype='fp8')
)
outputs = llm.generate("Hello, my name is")

NVFP4 KV Cache

NVFP4 KV cache requires offline quantization with ModelOpt. See the Offline Quantization section below.
from tensorrt_llm import LLM
from tensorrt_llm.llmapi import KvCacheConfig

llm = LLM(
    model='/path/to/model',
    kv_cache_config=KvCacheConfig(dtype='nvfp4')
)
outputs = llm.generate("Hello, my name is")

Offline Quantization with ModelOpt

If a pre-quantized model is not available on HuggingFace, you can quantize it offline using NVIDIA Model Optimizer.

FP8 Quantization

git clone https://github.com/NVIDIA/Model-Optimizer.git
cd Model-Optimizer/examples/llm_ptq
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8

NVFP4 KV Cache Quantization

git clone https://github.com/NVIDIA/Model-Optimizer.git
cd Model-Optimizer/examples/llm_ptq
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --kv_cache_quant nvfp4
Currently, TensorRT-LLM only supports FP8 weight/activation quantization when NVFP4 KV cache is enabled. Therefore, --quant fp8 is required.

Hardware Support Matrix

GPU ArchitectureNVFP4MXFP4FP8 (per tensor)FP8 (block scaling)FP8 (rowwise)FP8 KV CacheNVFP4 KV CacheW4A8 AWQW4A16 AWQW4A8 GPTQW4A16 GPTQ
Blackwell (sm120)-------
Blackwell (sm100/103)-
Hopper---
Ada Lovelace-----
Ampere--------
FP8 blockwise scaling GEMM kernels for sm100/103 use the MXFP8 recipe (E4M3 act/weight and UE8M0 act/weight scale), which differs from SM90’s FP8 recipe (E4M3 act/weight and FP32 act/weight scale).

Model Support Matrix

Quantization support varies by model architecture. Here are some examples:
ModelNVFP4FP8FP8 KV CacheW4A16 AWQW4A16 GPTQ
LLaMA-
LLaMA-v2
LLaMA 3---
LLaMA 4--
For multimodal models (BLIP2, LLaVA, VILA, Nougat), the vision component uses FP16 by default. The language component determines which quantization methods are supported.

Quantization Techniques

AWQ (Activation-aware Weight Quantization)

AWQ quantizes weights to 4-bit while preserving activation-aware importance:
  • W4A16 AWQ: 4-bit weights, 16-bit activations
  • W4A8 AWQ: 4-bit weights, 8-bit activations
  • Per-group quantization for better accuracy
  • Minimal accuracy loss compared to FP16

SmoothQuant

SmoothQuant balances quantization difficulty between weights and activations:
  • W8A8 quantization (8-bit weights and activations)
  • Per-channel or per-tensor scaling
  • Dynamic per-token quantization option
  • Better accuracy than naive INT8 quantization

GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ is a one-shot weight quantization method:
  • W4A16 GPTQ: 4-bit weights, 16-bit activations
  • W4A8 GPTQ: 4-bit weights, 8-bit activations
  • Per-group quantization
  • Layer-wise quantization for optimal accuracy

Python API Reference

from tensorrt_llm.quantization.mode import QuantAlgo, QuantMode

# Available quantization algorithms
quant_algos = [
    QuantAlgo.FP8,                    # FP8 per-tensor
    QuantAlgo.FP8_PER_CHANNEL_PER_TOKEN,  # FP8 rowwise
    QuantAlgo.FP8_BLOCK_SCALES,       # FP8 block scaling
    QuantAlgo.NVFP4,                  # NVFP4
    QuantAlgo.W4A16_AWQ,              # AWQ 4-bit weights, 16-bit acts
    QuantAlgo.W4A8_AWQ,               # AWQ 4-bit weights, 8-bit acts
    QuantAlgo.W4A16_GPTQ,             # GPTQ 4-bit weights, 16-bit acts
    QuantAlgo.W8A8_SQ_PER_CHANNEL,    # SmoothQuant per-channel
]

# Create QuantMode from algorithm
quant_mode = QuantMode.from_quant_algo(
    quant_algo=QuantAlgo.FP8,
    kv_cache_quant_algo=QuantAlgo.FP8
)

Best Practices

  • For maximum throughput: FP8 quantization on Hopper/Blackwell GPUs
  • For memory-constrained scenarios: W4A16 AWQ or GPTQ
  • For balanced performance: W4A8 AWQ with FP8 KV cache
  • For minimal accuracy loss: FP8 per-tensor or SmoothQuant
  • FP8 KV cache reduces memory usage by 2x vs FP16
  • NVFP4 KV cache reduces memory usage by 4x vs FP16
  • Minimal impact on generation quality
  • Essential for long-context applications
  • FP8 quantization provides best performance on Hopper+ GPUs
  • Use pre-quantized models when available (faster loading)
  • Enable KV cache quantization for long sequences
  • Test accuracy on your specific tasks before deployment

Additional Resources

Pre-quantized Models

Browse NVIDIA’s collection of pre-quantized models

Model Optimizer

Quantize your own models with NVIDIA Model Optimizer

ModelOpt Support Matrix

Check which models and quantization methods are supported

Build docs developers (and LLMs) love