Model Quantization Guide
Quantization reduces model size and improves inference performance by converting floating-point weights and activations to lower precision formats (typically 8-bit integers). ONNX Runtime provides comprehensive quantization tools supporting both static and dynamic quantization.Prerequisites
Quantization Methods
Dynamic Quantization
Dynamic quantization converts weights to int8 at runtime, with activations quantized dynamically during inference:Static Quantization
Static quantization uses calibration data to determine optimal quantization parameters:Configuration Options
Quantization Config
UseStaticQuantConfig for fine-grained control:
Calibration Methods
Advanced Quantization
Per-Channel Quantization
Quantize weights per output channel for better accuracy:Selective Quantization
Quantize only specific operators:QDQ Format Quantization
Quantize-Dequantize (QDQ) format is recommended for best compatibility:Transformer Model Quantization
Specialized quantization for transformer models:Calibration Data Best Practices
Representative Dataset
Quantization Extra Options
Model Preprocessing
Optimize model before quantization:Validating Quantized Models
Performance Comparison
Best Practices
- Use representative calibration data: 100-1000 samples covering your use cases
- Choose appropriate method: Dynamic for ease, static for best performance
- Enable per-channel quantization: Better accuracy with minimal overhead
- Use QDQ format: Better compatibility with execution providers
- Preprocess models: Run preprocessing before quantization
- Validate accuracy: Always compare quantized vs original outputs
- Test on target hardware: Performance gains vary by platform
- Consider symmetric quantization: For GPU/TensorRT deployment