INT8 Quantization

Overview

The quantize_onnx.py script applies dynamic INT8 quantization to ONNX models, reducing model size and inference latency for CPU and edge deployment.

Prerequisites

Export ONNX model

Ensure artifacts/model.onnx exists:

python deployment/export_onnx.py

Usage

python deployment/quantize_onnx.py

No command-line arguments required. Reads from artifacts/model.onnx and writes to artifacts/model.int8.onnx.

How It Works

The script applies ONNX Runtime’s dynamic quantization with INT8 weight compression:

deployment/quantize_onnx.py

from onnxruntime.quantization import QuantType, quantize_dynamic

source = Path("artifacts/model.onnx")
target = Path("artifacts/model.int8.onnx")

quantize_dynamic(
    model_input=str(source),
    model_output=str(target),
    weight_type=QuantType.QInt8,
)

Quantization Strategy

Dynamic Quantization
Static Quantization (Not Used)

Weights: Quantized to INT8 at model load timeActivations: Remain FP32, dynamically quantized per-batch during inferenceBest for:

CPU inference
Variable input distributions
Models where compute is weight-bound (linear layers, embeddings)

Performance Benefits

Model Size

2-4x smallerINT8 weights are 4x smaller than FP32. Typical reduction:

10 MB model → 3-4 MB
Faster model loading
Lower disk/network overhead

Inference Latency

1.5-2x fasterINT8 GEMM operations reduce compute:

Linear layers: ~2x speedup
Tree ensembles: ~1.2x speedup
Overall: 30-50% latency reduction

Memory Usage

2-3x lowerReduced weight memory:

Lower RAM footprint
Better CPU cache utilization
Enables larger batch sizes

Actual speedup depends on model architecture. Linear/fully-connected layers benefit most; tree-based models see smaller gains.

Expected Output

$ python deployment/quantize_onnx.py
Wrote artifacts/model.int8.onnx

$ ls -lh artifacts/model*.onnx
-rw-r--r-- 1 user user 12M model.onnx
-rw-r--r-- 1 user user  4M model.int8.onnx

Implementation Details

Quantization Algorithm

Weight Analysis: Identify quantizable operators (MatMul, Gemm, Conv)
Scale Computation: Calculate per-tensor scale factors: scale = max(abs(weights)) / 127
Quantization: Convert FP32 weights to INT8: w_int8 = round(w_fp32 / scale)
Graph Rewrite: Insert QuantizeLinear/DequantizeLinear nodes around operators
Optimization: Fuse dequant-op-quant sequences where possible

QuantType Options

QuantType.QInt8    # Signed 8-bit integer (used by this script)
QuantType.QUInt8   # Unsigned 8-bit integer

QInt8 is preferred for CPU inference as most SIMD instructions expect signed integers.

Trade-offs and Limitations

Accuracy Impact: INT8 quantization introduces numerical precision loss. Predictions near the decision threshold may shift.

When Quantization Works Well

Model has large fully-connected or embedding layers

Predictions are robust to small numerical perturbations

Threshold is well-separated from boundary cases

Feature distributions are stable and normalized

When to Avoid Quantization

Model relies on precise floating-point values (e.g., probability calibration)

Threshold is very close to 0.5 (high sensitivity to rounding errors)

Tree ensemble with shallow trees (quantization overhead > benefit)

Model already achieves target latency without quantization

Validation Workflow

After quantization, always validate prediction parity:

python deployment/parity_check.py --abs-tol 0.04 --mean-tol 0.01

Run parity check

Compare quantized model against original sklearn model:

python deployment/parity_check.py

Inspect parity report

Check artifacts/parity_report.json for max/mean absolute differences:

{
  "max_abs_diff": 0.0234,
  "mean_abs_diff": 0.0067,
  "passed": true
}

Tune tolerances if needed

If parity fails, widen tolerances or investigate outliers:

python deployment/parity_check.py --abs-tol 0.06 --mean-tol 0.02

Benchmark Performance

Compare quantized vs. non-quantized inference:

# Baseline ONNX (FP32)
python deployment/cpu_inference.py --backend onnx

# Quantized ONNX (INT8) - modify script to load model.int8.onnx
python deployment/cpu_inference.py --backend onnx

To benchmark the quantized model, temporarily modify cpu_inference.py line 65 to load model.int8.onnx instead of model.onnx.

Troubleshooting

Error: Missing artifacts/model.onnx. Run export first.

Cause: ONNX model not exported yet.Solution:

python deployment/export_onnx.py

Quantized model is same size as original

Cause: Model has no quantizable operators (e.g., pure tree ensemble).Solution: Tree models don’t benefit from weight quantization. Skip quantization step.

Parity check fails after quantization

Cause: INT8 precision loss exceeds tolerance.Solution:

Widen tolerances: --abs-tol 0.06 --mean-tol 0.02
Inspect feature distributions for outliers
Check if model uses probability calibration (sensitive to rounding)
Consider skipping quantization if accuracy is critical

Quantized model is slower than original

Cause: Overhead from dequantization dominates savings (small model or tree ensemble).Solution: Use non-quantized model. Quantization benefits require sufficient compute-bound operations.

Advanced Configuration

The script uses minimal configuration for simplicity. For advanced use cases, modify quantize_onnx.py:

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    model_input=str(source),
    model_output=str(target),
    weight_type=QuantType.QInt8,
    per_channel=True,  # Finer-grained quantization
)

Getting Started

Core Concepts

Training

Deployment

Optimization

Runtime

Overview

Prerequisites

Usage

How It Works

Quantization Strategy

Performance Benefits

Model Size

Inference Latency

Memory Usage

Expected Output

Implementation Details

Quantization Algorithm

QuantType Options

Trade-offs and Limitations

When Quantization Works Well

When to Avoid Quantization

Validation Workflow

Benchmark Performance

Troubleshooting

Advanced Configuration

References

Next Steps

Parity Validation

CPU Inference

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Training

Deployment

Optimization

Runtime

​Overview

​Prerequisites

​Usage

​How It Works

​Quantization Strategy

​Performance Benefits

Model Size

Inference Latency

Memory Usage

​Expected Output

​Implementation Details

​Quantization Algorithm

​QuantType Options

​Trade-offs and Limitations

​When Quantization Works Well

​When to Avoid Quantization

​Validation Workflow

​Benchmark Performance

​Troubleshooting

​Advanced Configuration

​References

​Next Steps

Parity Validation

CPU Inference

Build docs developers (and LLMs) love

Overview

Prerequisites

Usage

How It Works

Quantization Strategy

Performance Benefits

Expected Output

Implementation Details

Quantization Algorithm

QuantType Options

Trade-offs and Limitations

When Quantization Works Well

When to Avoid Quantization

Validation Workflow

Benchmark Performance

Troubleshooting

Advanced Configuration

References

Next Steps