Overview
Quantization reduces model size and inference latency by using lower-precision numeric formats. This guide covers practical quantization techniques for production ML systems.Quantization can reduce model size by 4x and improve latency by 2-3x with minimal accuracy loss.
What is Quantization?
Quantization converts high-precision weights (float32) to lower precision (int8, int4):- Smaller model size (4x-8x reduction)
- Faster inference (2x-4x speedup)
- Lower memory usage
- Reduced costs
- Slight accuracy loss (typically less than 1%)
- Hardware compatibility requirements
- Additional complexity
Quantization Techniques
Post-Training Quantization (PTQ)
Quantize pre-trained model without retraining. Pros:- Quick and easy
- No training data needed
- Works with any model
- Larger accuracy drop
- Less optimal
Quantization-Aware Training (QAT)
Train model with quantization in mind. Pros:- Better accuracy
- More robust
- Optimal performance
- Requires training
- More complex
- Longer timeline
Dynamic Quantization
Quantize weights statically, activations dynamically. Pros:- Easy to apply
- Good for RNNs/LSTMs
- No calibration needed
- Limited speedup
- CPU only
Static Quantization
Quantize both weights and activations statically. Pros:- Best performance
- Smallest size
- Hardware optimized
- Needs calibration data
- More complex
Precision Formats
| Format | Bits | Range | Use Case |
|---|---|---|---|
| FP32 | 32 | Full precision | Baseline, training |
| FP16 | 16 | Half precision | GPU inference |
| BF16 | 16 | Brain float | Training, modern GPUs |
| FP8 | 8 | 8-bit float | H100 GPUs, Transformers |
| INT8 | 8 | -128 to 127 | General quantization |
| INT4 | 4 | -8 to 7 | Aggressive compression |
| NF4 | 4 | Normal float 4-bit | LLMs (QLoRA) |
Benchmark Results
Real-world quantization benchmark using Text Generation Inference (TGI):Test Setup
- Hardware: AWS EC2 g5.4xlarge (1x A10 GPU, 16 vCPU, 64GB RAM)
- Model: microsoft/Phi-3.5-mini-instruct
- Dataset: gretelai/synthetic_text_to_sql
- Load: 100 concurrent users
- Duration: 5 minutes per test
- Cost: $1.624/hour
Performance Results
| Approach | Median (ms) | p95 (ms) | p98 (ms) | Size Reduction | Speed Improvement |
|---|---|---|---|---|---|
| default (FP32) | 5600 | 6200 | 6300 | 1x | 1x |
| fp8 | 5000 | 5800 | 6000 | ~4x | 1.1x |
| eetq | 5000 | 5700 | 5900 | ~4x | 1.1x |
| 4-bit-nf4 | 8500 | 9200 | 9400 | ~8x | 0.7x |
| 4-bit-fp4 | 8600 | 9300 | 9400 | ~8x | 0.7x |
| 8-bit | 13000 | 14000 | 14000 | ~4x | 0.4x |
Key Findings
FP8 and EETQ: Best Balance
FP8 and EETQ: Best Balance
- Similar latency to baseline
- 4x smaller model size
- 10% faster inference
- Recommended for production
4-bit: Maximum Compression
4-bit: Maximum Compression
- 8x smaller model size
- 30% slower than baseline
- Good for memory-constrained environments
- Consider for edge deployment
8-bit: Unexpected Slowdown
8-bit: Unexpected Slowdown
- 2.3x slower than baseline
- Not recommended for this hardware/model combination
- Hardware-specific results vary
Text Generation Inference (TGI)
Supported Quantization Methods
TGI supports multiple quantization techniques:| Method | Description | Hardware |
|---|---|---|
| bitsandbytes | 8-bit and 4-bit quantization | NVIDIA GPUs |
| bitsandbytes-nf4 | 4-bit NormalFloat | NVIDIA GPUs |
| bitsandbytes-fp4 | 4-bit Float | NVIDIA GPUs |
| gptq | Post-training quantization | NVIDIA GPUs |
| awq | Activation-aware quantization | NVIDIA GPUs |
| eetq | Easy and efficient quantization | NVIDIA GPUs |
| fp8 | 8-bit floating point | H100, A100 GPUs |
Running Benchmarks
Default (FP32)
FP8 Quantization
EETQ Quantization
4-bit NF4 Quantization
4-bit FP4 Quantization
8-bit Quantization
Load Test Script
load_test.py
Hardware Compatibility
Different quantization methods work best on specific hardware:NVIDIA GPUs
| GPU | FP16 | INT8 | INT4 | FP8 |
|---|---|---|---|---|
| H100 | ✅ | ✅ | ✅ | ✅ |
| A100 | ✅ | ✅ | ✅ | ⚠️ |
| A10 | ✅ | ✅ | ✅ | ❌ |
| T4 | ✅ | ✅ | ⚠️ | ❌ |
| V100 | ✅ | ⚠️ | ❌ | ❌ |
Cloud TPUs
Google Cloud TPUs support quantization:- TPU v4: INT8 quantization
- TPU v5e: INT8 inference optimized
- TPU v5e Inference Converter
AWS Inferentia
AWS custom ML chips:- Inferentia2: INT8, FP16, BF16
- Inferentia: INT8
- AWS Inferentia Blog
Other Optimization Techniques
Model Distillation
Train smaller model to mimic larger model:- DistilBERT: 40% smaller, 60% faster
- distil-whisper: 6x faster, 49% smaller
Model Pruning
Remove unimportant weights:Accelerators
Use specialized hardware accelerators:- NVIDIA TensorRT: Optimized inference on NVIDIA GPUs
- ONNX Runtime: Cross-platform optimization
- TensorRT-LLM: Optimized LLM inference
Decision Matrix
Choose quantization based on your constraints:Latency Critical (< 100ms)
✅ Recommended: FP8 or EETQ- Minimal latency impact
- 4x size reduction
- Modern GPU required
Memory Constrained
✅ Recommended: 4-bit NF4- 8x size reduction
- Acceptable latency increase
- Fits larger models in memory
Cost Optimization
✅ Recommended: INT8 or FP8- Smaller instances
- Lower GPU requirements
- Balance of speed and size
Edge Deployment
✅ Recommended: 4-bit quantization- Smallest size
- Runs on limited hardware
- Good for mobile/embedded
Best Practices
Always Benchmark
Test quantization on your specific hardware, model, and workload
Measure Accuracy
Validate model accuracy after quantization on held-out test set
Start Conservative
Begin with FP8/INT8 before trying aggressive 4-bit quantization
Monitor Production
Track latency, throughput, and accuracy in production
Validation Checklist
Before deploying quantized model:- Benchmark latency (p50, p95, p99)
- Measure throughput (RPS)
- Validate accuracy on test set
- Test under production load
- Compare costs (quantized vs baseline)
- Document quantization settings
- Set up monitoring alerts
Troubleshooting
Accuracy drops significantly
Accuracy drops significantly
Causes:
- Quantization too aggressive
- Model sensitive to precision
- Calibration data mismatch
- Use higher precision (INT8 instead of INT4)
- Try quantization-aware training
- Use better calibration data
- Consider mixed precision
No speed improvement
No speed improvement
Causes:
- Hardware doesn’t support quantization
- Bottleneck elsewhere (I/O, CPU)
- Dynamic quantization overhead
- Verify hardware compatibility
- Profile to find bottleneck
- Try static quantization
- Use appropriate precision for hardware
Out of memory errors
Out of memory errors
Causes:
- Quantization not actually applied
- Temporary memory during conversion
- Activations not quantized
- Verify quantization with model size
- Use smaller batch size during conversion
- Quantize activations too
- Check framework documentation
Resources
TGI Quantization
HuggingFace TGI quantization guide
vLLM Hardware
Supported quantization by hardware
Intel Neural Compressor
Comprehensive quantization toolkit
Model Compression
Top 23 open source projects
FastFormers
Microsoft’s transformer optimization
SparseML
Pruning and quantization toolkit
LLM Inference at Scale
Real-world TGI deployment
Distil-Whisper
Distilled speech recognition model
Next Steps
Practice Exercises
Apply what you’ve learned with hands-on practice tasks