Model Quantization - ML in Production Practice

Overview

Quantization reduces model size and inference latency by using lower-precision numeric formats. This guide covers practical quantization techniques for production ML systems.

Quantization can reduce model size by 4x and improve latency by 2-3x with minimal accuracy loss.

What is Quantization?

Quantization converts high-precision weights (float32) to lower precision (int8, int4):

Float32:  [0.12345678, -0.98765432, 0.45678901]
          ↓ quantization
Int8:     [32, -252, 117]

Benefits:

Smaller model size (4x-8x reduction)
Faster inference (2x-4x speedup)
Lower memory usage
Reduced costs

Trade-offs:

Slight accuracy loss (typically less than 1%)
Hardware compatibility requirements
Additional complexity

Quantization Techniques

Post-Training Quantization (PTQ)

Quantize pre-trained model without retraining. Pros:

Quick and easy
No training data needed
Works with any model

Cons:

Larger accuracy drop
Less optimal

Quantization-Aware Training (QAT)

Train model with quantization in mind. Pros:

Better accuracy
More robust
Optimal performance

Cons:

Requires training
More complex
Longer timeline

Dynamic Quantization

Quantize weights statically, activations dynamically. Pros:

Easy to apply
Good for RNNs/LSTMs
No calibration needed

Cons:

Limited speedup
CPU only

Static Quantization

Quantize both weights and activations statically. Pros:

Best performance
Smallest size
Hardware optimized

Cons:

Needs calibration data
More complex

Precision Formats

Format	Bits	Range	Use Case
FP32	32	Full precision	Baseline, training
FP16	16	Half precision	GPU inference
BF16	16	Brain float	Training, modern GPUs
FP8	8	8-bit float	H100 GPUs, Transformers
INT8	8	-128 to 127	General quantization
INT4	4	-8 to 7	Aggressive compression
NF4	4	Normal float 4-bit	LLMs (QLoRA)

Benchmark Results

Real-world quantization benchmark using Text Generation Inference (TGI):

Test Setup

Hardware: AWS EC2 g5.4xlarge (1x A10 GPU, 16 vCPU, 64GB RAM)
Model: microsoft/Phi-3.5-mini-instruct
Dataset: gretelai/synthetic_text_to_sql
Load: 100 concurrent users
Duration: 5 minutes per test
Cost: $1.624/hour

Performance Results

Approach	Median (ms)	p95 (ms)	p98 (ms)	Size Reduction	Speed Improvement
default (FP32)	5600	6200	6300	1x	1x
fp8	5000	5800	6000	~4x	1.1x
eetq	5000	5700	5900	~4x	1.1x
4-bit-nf4	8500	9200	9400	~8x	0.7x
4-bit-fp4	8600	9300	9400	~8x	0.7x
8-bit	13000	14000	14000	~4x	0.4x

Key Findings

FP8 and EETQ: Best Balance

Similar latency to baseline
4x smaller model size
10% faster inference
Recommended for production

4-bit: Maximum Compression

8x smaller model size
30% slower than baseline
Good for memory-constrained environments
Consider for edge deployment

8-bit: Unexpected Slowdown

2.3x slower than baseline
Not recommended for this hardware/model combination
Hardware-specific results vary

Important: Results vary significantly by hardware, model architecture, and workload. Always benchmark your specific setup.

Text Generation Inference (TGI)

Supported Quantization Methods

TGI supports multiple quantization techniques:

Method	Description	Hardware
bitsandbytes	8-bit and 4-bit quantization	NVIDIA GPUs
bitsandbytes-nf4	4-bit NormalFloat	NVIDIA GPUs
bitsandbytes-fp4	4-bit Float	NVIDIA GPUs
gptq	Post-training quantization	NVIDIA GPUs
awq	Activation-aware quantization	NVIDIA GPUs
eetq	Easy and efficient quantization	NVIDIA GPUs
fp8	8-bit floating point	H100, A100 GPUs

Running Benchmarks

Default (FP32)

docker run --gpus all --shm-size 1g -p 8005:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:2.3.0 \
  --model-id microsoft/Phi-3.5-mini-instruct

locust -f load_test.py -u 100 -r 10 --headless \
  --run-time 5m --host=http://0.0.0.0:8005 \
  --csv results/default.csv --html results/default.html

FP8 Quantization

docker run --gpus all --shm-size 1g -p 8005:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:2.3.0 \
  --model-id microsoft/Phi-3.5-mini-instruct \
  --quantize fp8

locust -f load_test.py -u 100 -r 10 --headless \
  --run-time 5m --host=http://0.0.0.0:8005 \
  --csv results/fp8.csv --html results/fp8.html

EETQ Quantization

docker run --gpus all --shm-size 1g -p 8005:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:2.3.0 \
  --model-id microsoft/Phi-3.5-mini-instruct \
  --quantize eetq

locust -f load_test.py -u 100 -r 10 --headless \
  --run-time 5m --host=http://0.0.0.0:8005 \
  --csv results/eetq.csv --html results/eetq.html

4-bit NF4 Quantization

docker run --gpus all --shm-size 1g -p 8005:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:2.3.0 \
  --model-id microsoft/Phi-3.5-mini-instruct \
  --quantize bitsandbytes-nf4

locust -f load_test.py -u 100 -r 10 --headless \
  --run-time 5m --host=http://0.0.0.0:8005 \
  --csv results/4-bit-nf4.csv --html results/4-bit-nf4.html

4-bit FP4 Quantization

docker run --gpus all --shm-size 1g -p 8005:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:2.3.0 \
  --model-id microsoft/Phi-3.5-mini-instruct \
  --quantize bitsandbytes-fp4

locust -f load_test.py -u 100 -r 10 --headless \
  --run-time 5m --host=http://0.0.0.0:8005 \
  --csv results/4-bit-fp4.csv --html results/4-bit-fp4.html

8-bit Quantization

docker run --gpus all --shm-size 1g -p 8005:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:2.3.0 \
  --model-id microsoft/Phi-3.5-mini-instruct \
  --quantize bitsandbytes

locust -f load_test.py -u 100 -r 10 --headless \
  --run-time 5m --host=http://0.0.0.0:8005 \
  --csv results/8-bit.csv --html results/8-bit.html

Load Test Script

load_test.py

from locust import HttpUser, task, between
from datasets import load_dataset
import random
import json


class LoadTestUser(HttpUser):
    wait_time = between(1, 5)

    def on_start(self):
        self.dataset = load_dataset("gretelai/synthetic_text_to_sql", split="train")
        self.dataset_size = len(self.dataset)

    @task
    def generate_sql(self):
        index = random.randint(0, self.dataset_size - 1)
        sample = self.dataset[index]

        sql_context = sample.get("sql_context", "No context provided")
        sql_prompt = sample.get("sql_prompt", "No prompt provided")

        input_text = (
            f"Generate sql for this context: {sql_context} for this query: {sql_prompt}"
        )

        payload = {"inputs": input_text}

        headers = {"accept": "application/json", "Content-Type": "application/json"}

        with self.client.post(
            "/generate",
            data=json.dumps(payload),
            headers=headers,
            name="/generate",
            catch_response=True,
        ) as response:
            if response.status_code != 200:
                response.failure(f"Failed with status code {response.status_code}")
            else:
                response.success()

Hardware Compatibility

Different quantization methods work best on specific hardware:

NVIDIA GPUs

GPU	FP16	INT8	INT4	FP8
H100	✅	✅	✅	✅
A100	✅	✅	✅	⚠️
A10	✅	✅	✅	❌
T4	✅	✅	⚠️	❌
V100	✅	⚠️	❌	❌

✅ Full support | ⚠️ Limited support | ❌ Not supported See vLLM hardware support for detailed compatibility.

Cloud TPUs

Google Cloud TPUs support quantization:

TPU v4: INT8 quantization
TPU v5e: INT8 inference optimized
TPU v5e Inference Converter

AWS Inferentia

AWS custom ML chips:

Inferentia2: INT8, FP16, BF16
Inferentia: INT8
AWS Inferentia Blog

Other Optimization Techniques

Model Distillation

Train smaller model to mimic larger model:

# Student model learns from teacher
teacher_output = teacher_model(input)
student_output = student_model(input)

loss = distillation_loss(student_output, teacher_output, temperature=3.0)

Examples:

DistilBERT: 40% smaller, 60% faster
distil-whisper: 6x faster, 49% smaller

Model Pruning

Remove unimportant weights:

import torch
from torch.nn.utils import prune

# Prune 30% of weights in linear layer
prune.l1_unstructured(model.linear, name='weight', amount=0.3)

Tools:

Accelerators

Use specialized hardware accelerators:

NVIDIA TensorRT: Optimized inference on NVIDIA GPUs
ONNX Runtime: Cross-platform optimization
TensorRT-LLM: Optimized LLM inference

Example with TensorRT-LLM on Modal:

import modal

app = modal.App("trtllm-inference")

image = modal.Image.from_registry(
    "nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3"
).run_commands(
    "pip install tensorrt_llm"
)

@app.function(gpu="A100", image=image)
def inference(prompt: str):
    # TensorRT-LLM inference
    pass

See Modal TensorRT-LLM example.

Decision Matrix

Choose quantization based on your constraints:

Latency Critical (< 100ms)

✅ Recommended: FP8 or EETQ

Minimal latency impact
4x size reduction
Modern GPU required

Memory Constrained

✅ Recommended: 4-bit NF4

8x size reduction
Acceptable latency increase
Fits larger models in memory

Cost Optimization

✅ Recommended: INT8 or FP8

Smaller instances
Lower GPU requirements
Balance of speed and size

Edge Deployment

✅ Recommended: 4-bit quantization

Smallest size
Runs on limited hardware
Good for mobile/embedded

Best Practices

Always Benchmark

Test quantization on your specific hardware, model, and workload

Measure Accuracy

Validate model accuracy after quantization on held-out test set

Start Conservative

Begin with FP8/INT8 before trying aggressive 4-bit quantization

Monitor Production

Track latency, throughput, and accuracy in production

Validation Checklist

Before deploying quantized model:

Benchmark latency (p50, p95, p99)
Measure throughput (RPS)
Validate accuracy on test set
Test under production load
Compare costs (quantized vs baseline)
Document quantization settings
Set up monitoring alerts

Troubleshooting

Accuracy drops significantly

Causes:

Quantization too aggressive
Model sensitive to precision
Calibration data mismatch

Solutions:

Use higher precision (INT8 instead of INT4)
Try quantization-aware training
Use better calibration data
Consider mixed precision

No speed improvement

Causes:

Hardware doesn’t support quantization
Bottleneck elsewhere (I/O, CPU)
Dynamic quantization overhead

Solutions:

Verify hardware compatibility
Profile to find bottleneck
Try static quantization
Use appropriate precision for hardware

Out of memory errors

Causes:

Quantization not actually applied
Temporary memory during conversion
Activations not quantized

Solutions:

Verify quantization with model size
Use smaller batch size during conversion
Quantize activations too
Check framework documentation

Resources

TGI Quantization

HuggingFace TGI quantization guide

vLLM Hardware

Supported quantization by hardware

Intel Neural Compressor

Comprehensive quantization toolkit

Model Compression

Top 23 open source projects

FastFormers

Microsoft’s transformer optimization

SparseML

Pruning and quantization toolkit

LLM Inference at Scale

Real-world TGI deployment

Distil-Whisper

Distilled speech recognition model

Next Steps

Practice Exercises

Apply what you’ve learned with hands-on practice tasks

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

​Overview

​What is Quantization?

​Quantization Techniques

​Post-Training Quantization (PTQ)

​Quantization-Aware Training (QAT)

​Dynamic Quantization

​Static Quantization

​Precision Formats

​Benchmark Results

​Test Setup

​Performance Results

​Key Findings

​Text Generation Inference (TGI)

​Supported Quantization Methods

​Running Benchmarks

​Default (FP32)

​FP8 Quantization

​EETQ Quantization

​4-bit NF4 Quantization

​4-bit FP4 Quantization

​8-bit Quantization

​Load Test Script

​Hardware Compatibility

​NVIDIA GPUs

​Cloud TPUs

​AWS Inferentia

​Other Optimization Techniques

​Model Distillation

​Model Pruning

​Accelerators

​Decision Matrix

​Latency Critical (< 100ms)

​Memory Constrained

​Cost Optimization

​Edge Deployment

​Best Practices

Always Benchmark

Measure Accuracy

Start Conservative

Monitor Production

​Validation Checklist

​Troubleshooting

​Resources

TGI Quantization

vLLM Hardware

Intel Neural Compressor

Model Compression

FastFormers

SparseML

LLM Inference at Scale

Distil-Whisper

​Next Steps

Practice Exercises

Build docs developers (and LLMs) love

Overview

What is Quantization?

Quantization Techniques

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

Dynamic Quantization

Static Quantization

Precision Formats

Benchmark Results

Test Setup

Performance Results

Key Findings

Text Generation Inference (TGI)

Supported Quantization Methods

Running Benchmarks

Default (FP32)

FP8 Quantization

EETQ Quantization

4-bit NF4 Quantization

4-bit FP4 Quantization

8-bit Quantization

Load Test Script

Hardware Compatibility

NVIDIA GPUs

Cloud TPUs

AWS Inferentia

Other Optimization Techniques

Model Distillation

Model Pruning

Accelerators

Decision Matrix

Latency Critical (< 100ms)

Memory Constrained

Cost Optimization

Edge Deployment

Best Practices

Validation Checklist

Troubleshooting

Resources

Next Steps