Skip to main content

Overview

The quantize_onnx.py script applies dynamic INT8 quantization to ONNX models, reducing model size and inference latency for CPU and edge deployment.

Prerequisites

1

Export ONNX model

Ensure artifacts/model.onnx exists:
python deployment/export_onnx.py

Usage

python deployment/quantize_onnx.py
No command-line arguments required. Reads from artifacts/model.onnx and writes to artifacts/model.int8.onnx.

How It Works

The script applies ONNX Runtime’s dynamic quantization with INT8 weight compression:
deployment/quantize_onnx.py
from onnxruntime.quantization import QuantType, quantize_dynamic

source = Path("artifacts/model.onnx")
target = Path("artifacts/model.int8.onnx")

quantize_dynamic(
    model_input=str(source),
    model_output=str(target),
    weight_type=QuantType.QInt8,
)

Quantization Strategy

Weights: Quantized to INT8 at model load timeActivations: Remain FP32, dynamically quantized per-batch during inferenceBest for:
  • CPU inference
  • Variable input distributions
  • Models where compute is weight-bound (linear layers, embeddings)

Performance Benefits

Model Size

2-4x smallerINT8 weights are 4x smaller than FP32. Typical reduction:
  • 10 MB model → 3-4 MB
  • Faster model loading
  • Lower disk/network overhead

Inference Latency

1.5-2x fasterINT8 GEMM operations reduce compute:
  • Linear layers: ~2x speedup
  • Tree ensembles: ~1.2x speedup
  • Overall: 30-50% latency reduction

Memory Usage

2-3x lowerReduced weight memory:
  • Lower RAM footprint
  • Better CPU cache utilization
  • Enables larger batch sizes
Actual speedup depends on model architecture. Linear/fully-connected layers benefit most; tree-based models see smaller gains.

Expected Output

$ python deployment/quantize_onnx.py
Wrote artifacts/model.int8.onnx
$ ls -lh artifacts/model*.onnx
-rw-r--r-- 1 user user 12M model.onnx
-rw-r--r-- 1 user user  4M model.int8.onnx

Implementation Details

Quantization Algorithm

  1. Weight Analysis: Identify quantizable operators (MatMul, Gemm, Conv)
  2. Scale Computation: Calculate per-tensor scale factors: scale = max(abs(weights)) / 127
  3. Quantization: Convert FP32 weights to INT8: w_int8 = round(w_fp32 / scale)
  4. Graph Rewrite: Insert QuantizeLinear/DequantizeLinear nodes around operators
  5. Optimization: Fuse dequant-op-quant sequences where possible

QuantType Options

QuantType.QInt8    # Signed 8-bit integer (used by this script)
QuantType.QUInt8   # Unsigned 8-bit integer
QInt8 is preferred for CPU inference as most SIMD instructions expect signed integers.

Trade-offs and Limitations

Accuracy Impact: INT8 quantization introduces numerical precision loss. Predictions near the decision threshold may shift.

When Quantization Works Well

Model has large fully-connected or embedding layers
Predictions are robust to small numerical perturbations
Threshold is well-separated from boundary cases
Feature distributions are stable and normalized

When to Avoid Quantization

Model relies on precise floating-point values (e.g., probability calibration)
Threshold is very close to 0.5 (high sensitivity to rounding errors)
Tree ensemble with shallow trees (quantization overhead > benefit)
Model already achieves target latency without quantization

Validation Workflow

After quantization, always validate prediction parity:
python deployment/parity_check.py --abs-tol 0.04 --mean-tol 0.01
1

Run parity check

Compare quantized model against original sklearn model:
python deployment/parity_check.py
2

Inspect parity report

Check artifacts/parity_report.json for max/mean absolute differences:
{
  "max_abs_diff": 0.0234,
  "mean_abs_diff": 0.0067,
  "passed": true
}
3

Tune tolerances if needed

If parity fails, widen tolerances or investigate outliers:
python deployment/parity_check.py --abs-tol 0.06 --mean-tol 0.02

Benchmark Performance

Compare quantized vs. non-quantized inference:
# Baseline ONNX (FP32)
python deployment/cpu_inference.py --backend onnx

# Quantized ONNX (INT8) - modify script to load model.int8.onnx
python deployment/cpu_inference.py --backend onnx
To benchmark the quantized model, temporarily modify cpu_inference.py line 65 to load model.int8.onnx instead of model.onnx.

Troubleshooting

Cause: ONNX model not exported yet.Solution:
python deployment/export_onnx.py
Cause: Model has no quantizable operators (e.g., pure tree ensemble).Solution: Tree models don’t benefit from weight quantization. Skip quantization step.
Cause: INT8 precision loss exceeds tolerance.Solution:
  1. Widen tolerances: --abs-tol 0.06 --mean-tol 0.02
  2. Inspect feature distributions for outliers
  3. Check if model uses probability calibration (sensitive to rounding)
  4. Consider skipping quantization if accuracy is critical
Cause: Overhead from dequantization dominates savings (small model or tree ensemble).Solution: Use non-quantized model. Quantization benefits require sufficient compute-bound operations.

Advanced Configuration

The script uses minimal configuration for simplicity. For advanced use cases, modify quantize_onnx.py:
from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    model_input=str(source),
    model_output=str(target),
    weight_type=QuantType.QInt8,
    per_channel=True,  # Finer-grained quantization
)

References

Next Steps

Parity Validation

Validate quantized model accuracy

CPU Inference

Benchmark quantized model performance

Build docs developers (and LLMs) love