Skip to main content

Overview

Quantization reduces model size and inference latency by using lower-precision numeric formats. This guide covers practical quantization techniques for production ML systems.
Quantization can reduce model size by 4x and improve latency by 2-3x with minimal accuracy loss.

What is Quantization?

Quantization converts high-precision weights (float32) to lower precision (int8, int4):
Float32:  [0.12345678, -0.98765432, 0.45678901]
          ↓ quantization
Int8:     [32, -252, 117]
Benefits:
  • Smaller model size (4x-8x reduction)
  • Faster inference (2x-4x speedup)
  • Lower memory usage
  • Reduced costs
Trade-offs:
  • Slight accuracy loss (typically less than 1%)
  • Hardware compatibility requirements
  • Additional complexity

Quantization Techniques

Post-Training Quantization (PTQ)

Quantize pre-trained model without retraining. Pros:
  • Quick and easy
  • No training data needed
  • Works with any model
Cons:
  • Larger accuracy drop
  • Less optimal

Quantization-Aware Training (QAT)

Train model with quantization in mind. Pros:
  • Better accuracy
  • More robust
  • Optimal performance
Cons:
  • Requires training
  • More complex
  • Longer timeline

Dynamic Quantization

Quantize weights statically, activations dynamically. Pros:
  • Easy to apply
  • Good for RNNs/LSTMs
  • No calibration needed
Cons:
  • Limited speedup
  • CPU only

Static Quantization

Quantize both weights and activations statically. Pros:
  • Best performance
  • Smallest size
  • Hardware optimized
Cons:
  • Needs calibration data
  • More complex

Precision Formats

FormatBitsRangeUse Case
FP3232Full precisionBaseline, training
FP1616Half precisionGPU inference
BF1616Brain floatTraining, modern GPUs
FP888-bit floatH100 GPUs, Transformers
INT88-128 to 127General quantization
INT44-8 to 7Aggressive compression
NF44Normal float 4-bitLLMs (QLoRA)

Benchmark Results

Real-world quantization benchmark using Text Generation Inference (TGI):

Test Setup

  • Hardware: AWS EC2 g5.4xlarge (1x A10 GPU, 16 vCPU, 64GB RAM)
  • Model: microsoft/Phi-3.5-mini-instruct
  • Dataset: gretelai/synthetic_text_to_sql
  • Load: 100 concurrent users
  • Duration: 5 minutes per test
  • Cost: $1.624/hour

Performance Results

ApproachMedian (ms)p95 (ms)p98 (ms)Size ReductionSpeed Improvement
default (FP32)5600620063001x1x
fp8500058006000~4x1.1x
eetq500057005900~4x1.1x
4-bit-nf4850092009400~8x0.7x
4-bit-fp4860093009400~8x0.7x
8-bit130001400014000~4x0.4x

Key Findings

  • Similar latency to baseline
  • 4x smaller model size
  • 10% faster inference
  • Recommended for production
  • 8x smaller model size
  • 30% slower than baseline
  • Good for memory-constrained environments
  • Consider for edge deployment
  • 2.3x slower than baseline
  • Not recommended for this hardware/model combination
  • Hardware-specific results vary
Important: Results vary significantly by hardware, model architecture, and workload. Always benchmark your specific setup.

Text Generation Inference (TGI)

Supported Quantization Methods

TGI supports multiple quantization techniques:
MethodDescriptionHardware
bitsandbytes8-bit and 4-bit quantizationNVIDIA GPUs
bitsandbytes-nf44-bit NormalFloatNVIDIA GPUs
bitsandbytes-fp44-bit FloatNVIDIA GPUs
gptqPost-training quantizationNVIDIA GPUs
awqActivation-aware quantizationNVIDIA GPUs
eetqEasy and efficient quantizationNVIDIA GPUs
fp88-bit floating pointH100, A100 GPUs

Running Benchmarks

Default (FP32)

docker run --gpus all --shm-size 1g -p 8005:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:2.3.0 \
  --model-id microsoft/Phi-3.5-mini-instruct

locust -f load_test.py -u 100 -r 10 --headless \
  --run-time 5m --host=http://0.0.0.0:8005 \
  --csv results/default.csv --html results/default.html

FP8 Quantization

docker run --gpus all --shm-size 1g -p 8005:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:2.3.0 \
  --model-id microsoft/Phi-3.5-mini-instruct \
  --quantize fp8

locust -f load_test.py -u 100 -r 10 --headless \
  --run-time 5m --host=http://0.0.0.0:8005 \
  --csv results/fp8.csv --html results/fp8.html

EETQ Quantization

docker run --gpus all --shm-size 1g -p 8005:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:2.3.0 \
  --model-id microsoft/Phi-3.5-mini-instruct \
  --quantize eetq

locust -f load_test.py -u 100 -r 10 --headless \
  --run-time 5m --host=http://0.0.0.0:8005 \
  --csv results/eetq.csv --html results/eetq.html

4-bit NF4 Quantization

docker run --gpus all --shm-size 1g -p 8005:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:2.3.0 \
  --model-id microsoft/Phi-3.5-mini-instruct \
  --quantize bitsandbytes-nf4

locust -f load_test.py -u 100 -r 10 --headless \
  --run-time 5m --host=http://0.0.0.0:8005 \
  --csv results/4-bit-nf4.csv --html results/4-bit-nf4.html

4-bit FP4 Quantization

docker run --gpus all --shm-size 1g -p 8005:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:2.3.0 \
  --model-id microsoft/Phi-3.5-mini-instruct \
  --quantize bitsandbytes-fp4

locust -f load_test.py -u 100 -r 10 --headless \
  --run-time 5m --host=http://0.0.0.0:8005 \
  --csv results/4-bit-fp4.csv --html results/4-bit-fp4.html

8-bit Quantization

docker run --gpus all --shm-size 1g -p 8005:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:2.3.0 \
  --model-id microsoft/Phi-3.5-mini-instruct \
  --quantize bitsandbytes

locust -f load_test.py -u 100 -r 10 --headless \
  --run-time 5m --host=http://0.0.0.0:8005 \
  --csv results/8-bit.csv --html results/8-bit.html

Load Test Script

load_test.py
from locust import HttpUser, task, between
from datasets import load_dataset
import random
import json


class LoadTestUser(HttpUser):
    wait_time = between(1, 5)

    def on_start(self):
        self.dataset = load_dataset("gretelai/synthetic_text_to_sql", split="train")
        self.dataset_size = len(self.dataset)

    @task
    def generate_sql(self):
        index = random.randint(0, self.dataset_size - 1)
        sample = self.dataset[index]

        sql_context = sample.get("sql_context", "No context provided")
        sql_prompt = sample.get("sql_prompt", "No prompt provided")

        input_text = (
            f"Generate sql for this context: {sql_context} for this query: {sql_prompt}"
        )

        payload = {"inputs": input_text}

        headers = {"accept": "application/json", "Content-Type": "application/json"}

        with self.client.post(
            "/generate",
            data=json.dumps(payload),
            headers=headers,
            name="/generate",
            catch_response=True,
        ) as response:
            if response.status_code != 200:
                response.failure(f"Failed with status code {response.status_code}")
            else:
                response.success()

Hardware Compatibility

Different quantization methods work best on specific hardware:

NVIDIA GPUs

GPUFP16INT8INT4FP8
H100
A100⚠️
A10
T4⚠️
V100⚠️
✅ Full support | ⚠️ Limited support | ❌ Not supported See vLLM hardware support for detailed compatibility.

Cloud TPUs

Google Cloud TPUs support quantization:

AWS Inferentia

AWS custom ML chips:

Other Optimization Techniques

Model Distillation

Train smaller model to mimic larger model:
# Student model learns from teacher
teacher_output = teacher_model(input)
student_output = student_model(input)

loss = distillation_loss(student_output, teacher_output, temperature=3.0)
Examples:

Model Pruning

Remove unimportant weights:
import torch
from torch.nn.utils import prune

# Prune 30% of weights in linear layer
prune.l1_unstructured(model.linear, name='weight', amount=0.3)
Tools:

Accelerators

Use specialized hardware accelerators:
  • NVIDIA TensorRT: Optimized inference on NVIDIA GPUs
  • ONNX Runtime: Cross-platform optimization
  • TensorRT-LLM: Optimized LLM inference
Example with TensorRT-LLM on Modal:
import modal

app = modal.App("trtllm-inference")

image = modal.Image.from_registry(
    "nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3"
).run_commands(
    "pip install tensorrt_llm"
)

@app.function(gpu="A100", image=image)
def inference(prompt: str):
    # TensorRT-LLM inference
    pass
See Modal TensorRT-LLM example.

Decision Matrix

Choose quantization based on your constraints:

Latency Critical (< 100ms)

Recommended: FP8 or EETQ
  • Minimal latency impact
  • 4x size reduction
  • Modern GPU required

Memory Constrained

Recommended: 4-bit NF4
  • 8x size reduction
  • Acceptable latency increase
  • Fits larger models in memory

Cost Optimization

Recommended: INT8 or FP8
  • Smaller instances
  • Lower GPU requirements
  • Balance of speed and size

Edge Deployment

Recommended: 4-bit quantization
  • Smallest size
  • Runs on limited hardware
  • Good for mobile/embedded

Best Practices

Always Benchmark

Test quantization on your specific hardware, model, and workload

Measure Accuracy

Validate model accuracy after quantization on held-out test set

Start Conservative

Begin with FP8/INT8 before trying aggressive 4-bit quantization

Monitor Production

Track latency, throughput, and accuracy in production

Validation Checklist

Before deploying quantized model:
  • Benchmark latency (p50, p95, p99)
  • Measure throughput (RPS)
  • Validate accuracy on test set
  • Test under production load
  • Compare costs (quantized vs baseline)
  • Document quantization settings
  • Set up monitoring alerts

Troubleshooting

Causes:
  • Quantization too aggressive
  • Model sensitive to precision
  • Calibration data mismatch
Solutions:
  • Use higher precision (INT8 instead of INT4)
  • Try quantization-aware training
  • Use better calibration data
  • Consider mixed precision
Causes:
  • Hardware doesn’t support quantization
  • Bottleneck elsewhere (I/O, CPU)
  • Dynamic quantization overhead
Solutions:
  • Verify hardware compatibility
  • Profile to find bottleneck
  • Try static quantization
  • Use appropriate precision for hardware
Causes:
  • Quantization not actually applied
  • Temporary memory during conversion
  • Activations not quantized
Solutions:
  • Verify quantization with model size
  • Use smaller batch size during conversion
  • Quantize activations too
  • Check framework documentation

Resources

TGI Quantization

HuggingFace TGI quantization guide

vLLM Hardware

Supported quantization by hardware

Intel Neural Compressor

Comprehensive quantization toolkit

Model Compression

Top 23 open source projects

FastFormers

Microsoft’s transformer optimization

SparseML

Pruning and quantization toolkit

LLM Inference at Scale

Real-world TGI deployment

Distil-Whisper

Distilled speech recognition model

Next Steps

Practice Exercises

Apply what you’ve learned with hands-on practice tasks

Build docs developers (and LLMs) love