Overview
Thequantize_onnx.py script applies dynamic INT8 quantization to ONNX models, reducing model size and inference latency for CPU and edge deployment.
Prerequisites
Usage
No command-line arguments required. Reads from
artifacts/model.onnx and writes to artifacts/model.int8.onnx.How It Works
The script applies ONNX Runtime’s dynamic quantization with INT8 weight compression:deployment/quantize_onnx.py
Quantization Strategy
- Dynamic Quantization
- Static Quantization (Not Used)
Weights: Quantized to INT8 at model load timeActivations: Remain FP32, dynamically quantized per-batch during inferenceBest for:
- CPU inference
- Variable input distributions
- Models where compute is weight-bound (linear layers, embeddings)
Performance Benefits
Model Size
2-4x smallerINT8 weights are 4x smaller than FP32. Typical reduction:
- 10 MB model → 3-4 MB
- Faster model loading
- Lower disk/network overhead
Inference Latency
1.5-2x fasterINT8 GEMM operations reduce compute:
- Linear layers: ~2x speedup
- Tree ensembles: ~1.2x speedup
- Overall: 30-50% latency reduction
Memory Usage
2-3x lowerReduced weight memory:
- Lower RAM footprint
- Better CPU cache utilization
- Enables larger batch sizes
Actual speedup depends on model architecture. Linear/fully-connected layers benefit most; tree-based models see smaller gains.
Expected Output
Implementation Details
Quantization Algorithm
- Weight Analysis: Identify quantizable operators (MatMul, Gemm, Conv)
- Scale Computation: Calculate per-tensor scale factors:
scale = max(abs(weights)) / 127 - Quantization: Convert FP32 weights to INT8:
w_int8 = round(w_fp32 / scale) - Graph Rewrite: Insert QuantizeLinear/DequantizeLinear nodes around operators
- Optimization: Fuse dequant-op-quant sequences where possible
QuantType Options
QInt8 is preferred for CPU inference as most SIMD instructions expect signed integers.
Trade-offs and Limitations
When Quantization Works Well
Model has large fully-connected or embedding layers
Predictions are robust to small numerical perturbations
Threshold is well-separated from boundary cases
Feature distributions are stable and normalized
When to Avoid Quantization
Validation Workflow
After quantization, always validate prediction parity:Benchmark Performance
Compare quantized vs. non-quantized inference:Troubleshooting
Error: Missing artifacts/model.onnx. Run export first.
Error: Missing artifacts/model.onnx. Run export first.
Cause: ONNX model not exported yet.Solution:
Quantized model is same size as original
Quantized model is same size as original
Cause: Model has no quantizable operators (e.g., pure tree ensemble).Solution: Tree models don’t benefit from weight quantization. Skip quantization step.
Parity check fails after quantization
Parity check fails after quantization
Cause: INT8 precision loss exceeds tolerance.Solution:
- Widen tolerances:
--abs-tol 0.06 --mean-tol 0.02 - Inspect feature distributions for outliers
- Check if model uses probability calibration (sensitive to rounding)
- Consider skipping quantization if accuracy is critical
Quantized model is slower than original
Quantized model is slower than original
Cause: Overhead from dequantization dominates savings (small model or tree ensemble).Solution: Use non-quantized model. Quantization benefits require sufficient compute-bound operations.
Advanced Configuration
The script uses minimal configuration for simplicity. For advanced use cases, modifyquantize_onnx.py:
References
Next Steps
Parity Validation
Validate quantized model accuracy
CPU Inference
Benchmark quantized model performance