Skip to main content
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices and improving throughput.

Why quantization?

Quantization provides several benefits:
  • Reduced memory usage - Fit larger models or longer contexts in available GPU memory
  • Faster inference - Lower precision arithmetic can be faster on modern hardware
  • Lower costs - Serve models on smaller/fewer GPUs
  • Higher throughput - Process more requests per second with the same hardware
For the best quantization experience, we recommend LLM Compressor, a library for optimizing models for deployment with vLLM that supports FP8, INT8, INT4, and other formats.

Supported quantization methods

vLLM supports the following quantization formats:

FP8

8-bit floating point (W8A8)Best for: Ada/Hopper GPUs, AMD GPUsMinimal accuracy loss with significant speedup

INT8

8-bit integer (W8A8)Best for: Turing+ GPUs, CPUsGood balance of compression and accuracy

INT4

4-bit integer (W4A16)Best for: Maximizing compressionHighest compression ratio

All supported methods

MethodDescriptionKey Use Case
AutoAWQActivation-aware Weight QuantizationBalanced INT4 quantization
GPTQGPT QuantizationINT4 weights, widely supported
FP88-bit floating pointAda/Hopper GPU acceleration
INT88-bit integerBroad hardware support
INT44-bit integerMaximum compression
BitsAndBytesDynamic quantizationEasy to use, no calibration
GGUFGPT-Generated Unified Formatllama.cpp compatibility
Quantized KV CacheKV cache compressionLonger context windows
NVIDIA TensorRT-Model OptimizerTensorRT optimizationsNVIDIA GPU optimization
AMD QuarkAMD-optimized quantizationAMD GPU optimization
Intel Neural CompressorIntel optimizationIntel CPU/GPU optimization
TorchAOPyTorch native quantizationExperimental PyTorch integration

Hardware compatibility

The table below shows quantization method compatibility with different hardware:
MethodVolta (SM 7.0)Turing (SM 7.5)Ampere (SM 8.0/8.6)Ada (SM 8.9)Hopper (SM 9.0)AMD GPUIntel GPUx86 CPU
AWQ
GPTQ
Marlin (GPTQ/AWQ/FP8/FP4)✅*
INT8 (W8A8)
FP8 (W8A8)
BitsAndBytes
DeepSpeedFP
GGUF
*Turing does not support Marlin MXFP4.For Google TPU quantization support, see the TPU-Inference documentation.

Quick start

from vllm import LLM

# Load FP8 quantized model
llm = LLM(model="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8")
outputs = llm.generate("Hello, my name is")
print(outputs[0].outputs[0].text)

GPTQ quantization

from vllm import LLM

# Load GPTQ quantized model
llm = LLM(model="TheBloke/Llama-2-7B-Chat-GPTQ")
outputs = llm.generate("Hello, my name is")
print(outputs[0].outputs[0].text)

AWQ quantization

from vllm import LLM

# Load AWQ quantized model
llm = LLM(model="TheBloke/Llama-2-7B-Chat-AWQ", quantization="awq")
outputs = llm.generate("Hello, my name is")
print(outputs[0].outputs[0].text)

Creating quantized models

Using LLM Compressor

LLM Compressor is the recommended tool for quantizing models for vLLM:
pip install llmcompressor
from llmcompressor.transformers import oneshot
from transformers import AutoTokenizer

# Define the model
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Configure quantization
recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: "float"
                        strategy: "tensor"
                    input_activations:
                        num_bits: 8
                        type: "float"
                        strategy: "tensor"
"""

# Apply quantization
oneshot(
    model=MODEL_ID,
    recipe=recipe,
    output_dir="./Meta-Llama-3-8B-Instruct-FP8",
    tokenizer=tokenizer,
)

Using AutoGPTQ

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer

model_name = "meta-llama/Llama-2-7b-hf"
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
)

model = AutoGPTQForCausalLM.from_pretrained(
    model_name,
    quantize_config=quantize_config,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Quantize
model.quantize(calibration_dataset)

# Save
model.save_quantized("./Llama-2-7b-GPTQ")
tokenizer.save_pretrained("./Llama-2-7b-GPTQ")

Quantized KV cache

Quantize the KV cache to support longer context windows:
from vllm import LLM

llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    kv_cache_dtype="fp8",
)
Supported KV cache dtypes:
  • fp8 - 8-bit floating point
  • fp8_e5m2 - FP8 with 5 exponent bits, 2 mantissa bits
  • fp8_e4m3 - FP8 with 4 exponent bits, 3 mantissa bits
See Quantized KV Cache for more details.

Performance comparison

Typical memory and performance characteristics:
QuantizationMemory vs FP16Speed vs FP16Accuracy vs FP16
FP8 (W8A8)~50%1.5-2x fasterMinimal loss (<1%)
INT8 (W8A8)~50%1.3-1.5x fasterMinimal loss (~1%)
INT4 (W4A16)~25%1.1-1.3x fasterSmall loss (2-5%)
GPTQ (4-bit)~25%1.2-1.4x fasterSmall loss (2-5%)
AWQ (4-bit)~25%1.2-1.4x fasterMinimal loss (<2%)
Actual performance varies based on model size, hardware, batch size, and sequence length. Always benchmark your specific use case.

Choosing a quantization method

Recommended: FP8
  • Native FP8 tensor cores provide excellent performance
  • Minimal accuracy loss
  • Easy to use with LLM Compressor
Alternative: INT4/INT8 for maximum compression
Recommended: INT8
  • Good balance of speed and accuracy
  • Broad support across models
Alternative: AWQ/GPTQ for higher compression
Recommended: FP8
  • AMD GPUs have good FP8 support
  • Use AMD Quark for optimized quantization
Alternative: GGUF for compatibility
Recommended: INT8
  • Good CPU performance
  • Use Intel Neural Compressor for Intel CPUs
Alternative: GPTQ/AWQ for memory-constrained systems
Recommended: INT4 or GPTQ/AWQ
  • ~4x smaller than FP16
  • Good for very large models
  • Accept some accuracy tradeoff
Combine with quantized KV cache for even longer contexts

Custom quantization plugins

vLLM supports registering custom quantization methods using the @register_quantization_config decorator.
import torch
from vllm.model_executor.layers.quantization import (
    register_quantization_config,
)
from vllm.model_executor.layers.quantization.base_config import (
    QuantizationConfig,
)

@register_quantization_config("my_quant")
class MyQuantConfig(QuantizationConfig):
    def get_name(self) -> str:
        return "my_quant"

    def get_supported_act_dtypes(self) -> list:
        return [torch.float16, torch.bfloat16]

    @classmethod
    def get_min_capability(cls) -> int:
        return -1  # No GPU restriction

    @staticmethod
    def get_config_filenames() -> list[str]:
        return ["my_quant_config.json"]

    @classmethod
    def from_config(cls, config: dict) -> "MyQuantConfig":
        return cls()

    def get_quant_method(self, layer, prefix):
        # Return quantization method for layer
        ...
Use your custom quantization:
import my_quant_plugin
from vllm import LLM

llm = LLM(model="your-model", quantization="my_quant")
See the Plugin System documentation for more details.

Best practices

  1. Evaluate baseline: Test the full-precision model first
  2. Choose method: Based on hardware and compression needs
  3. Quantize: Use LLM Compressor or model-specific tools
  4. Validate: Check accuracy on representative tasks
  5. Benchmark: Measure throughput and latency improvements
  6. Iterate: Adjust quantization parameters if needed
  • Use calibration data similar to your inference distribution
  • FP8 and INT8 typically have <1% accuracy loss
  • INT4 may require more careful tuning
  • Consider per-channel vs per-tensor quantization
  • Test on downstream tasks, not just perplexity
  • Combine weight quantization with KV cache quantization
  • Use appropriate batch sizes for your hardware
  • Profile to identify bottlenecks
  • Consider Marlin kernels for GPTQ/AWQ on supported GPUs
  • Enable tensor parallelism for large models

Troubleshooting

  • Try a higher precision method (INT8 instead of INT4)
  • Use more calibration data
  • Check if the model architecture is well-supported
  • Try AWQ instead of GPTQ for better activation-aware quantization
  • Ensure you’re using the right hardware for the quantization method
  • Check that optimized kernels (Marlin, etc.) are being used
  • Verify batch size is appropriate
  • Profile to identify bottlenecks
  • Enable KV cache quantization
  • Reduce batch size or max sequence length
  • Try higher compression (INT4 instead of INT8)
  • Enable tensor parallelism to distribute across GPUs

Next steps

LLM Compressor

Quantize models for vLLM deployment

FP8 quantization

High-performance 8-bit floating point

GPTQ quantization

Popular 4-bit quantization method

Quantized KV cache

Extend context length with cache quantization

Build docs developers (and LLMs) love