Model Quantization

Quantization reduces the numerical precision of model weights and activations, significantly decreasing memory footprint and inference latency while maintaining acceptable accuracy. The Edge AI Hardware Optimization framework supports FP16 (half precision) and INT8 (8-bit integer) quantization.

Overview

Quantization converts high-precision floating-point numbers to lower-precision representations:

FP32 → FP16: Reduces memory by 50%, speeds up inference on hardware with FP16 support
FP32 → INT8: Reduces memory by 75%, provides 2-4× speedup on CPU and edge accelerators

Supported Precisions

# Available precision modes
PRECISIONS = ['fp32', 'fp16', 'int8']

# Default configuration from configs/default.yaml
precisions: [fp32, fp16, int8]

Precision	Bytes per Weight	Memory Ratio	Typical Speedup	Accuracy Impact
FP32	4	1.0×	1.0×	Baseline
FP16	2	0.5×	1.5-2×	< 0.1%
INT8	1	0.25×	2-4×	0.5-2%

FP16 Quantization

FP16 (half precision) quantization is the simplest form of quantization, converting all model parameters and activations from 32-bit to 16-bit floating-point format.

Function Signature

def to_fp16(model: nn.Module) -> nn.Module:
    """Convert model to FP16 precision.
    
    Args:
        model: Input PyTorch model in FP32
        
    Returns:
        New model with FP16 weights and activations
    """
    fp16_model = deepcopy(model).half().eval()
    return fp16_model

Defined in src/edge_opt/quantization.py:12-14

Usage

import torch
from edge_opt.model import SmallCNN
from edge_opt.quantization import to_fp16

# Load trained model
model = SmallCNN()
model.load_state_dict(torch.load('trained_model.pth'))

# Convert to FP16
fp16_model = to_fp16(model)

# Model is now in half precision and eval mode
print(f"Original dtype: {next(model.parameters()).dtype}")  # torch.float32
print(f"FP16 dtype: {next(fp16_model.parameters()).dtype}")     # torch.float16

FP16 Implementation Details

The to_fp16 function performs three operations:

Deep copy: Creates an independent copy to avoid modifying the original model
Half precision conversion: Calls .half() to convert all parameters to torch.float16
Eval mode: Sets the model to evaluation mode with .eval()

# From src/edge_opt/quantization.py:12-14
fp16_model = deepcopy(model).half().eval()

FP16 models require FP16 input tensors. The framework handles this automatically when you specify precision='fp16' in evaluation functions.

INT8 Quantization

INT8 quantization uses PyTorch’s FX Graph Mode Quantization to convert models to 8-bit integer precision. This requires a calibration process to determine optimal quantization parameters.

Function Signature

def to_int8(
    model: nn.Module,
    calibration_loader: DataLoader,
    calibration_batches: int = 10
) -> nn.Module:
    """Convert model to INT8 precision using post-training quantization.
    
    Args:
        model: Input PyTorch model in FP32
        calibration_loader: DataLoader for calibration data
        calibration_batches: Number of batches for calibration
        
    Returns:
        Quantized model with INT8 weights and activations
    """

Defined in src/edge_opt/quantization.py:17-30

Calibration Process

INT8 quantization requires calibration to collect activation statistics:

Prepare Model for Quantization

Insert observer modules to track activation ranges during calibration.

Run Calibration Data

Pass calibration batches through the model to collect min/max statistics.

Calculate Quantization Parameters

Determine optimal scale and zero-point for each quantized layer.

Convert to INT8

Replace FP32 operations with INT8 equivalents using the computed parameters.

Usage

import torch
import yaml
from torch.utils.data import DataLoader
from edge_opt.quantization import to_int8

# Load configuration
with open('configs/default.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Create calibration dataloader
# Use a subset of training or validation data
calibration_loader = DataLoader(
    calibration_dataset,
    batch_size=config['batch_size'],
    shuffle=False
)

# Convert to INT8
int8_model = to_int8(
    model=trained_model,
    calibration_loader=calibration_loader,
    calibration_batches=config['calibration_batches']  # default: 8
)

print("Model quantized to INT8")

INT8 Implementation Details

The quantization process in src/edge_opt/quantization.py:17-30 follows these steps:

1. Prepare Model

float_model = deepcopy(model).eval()
qconfig_mapping = get_default_qconfig_mapping("fbgemm")

Creates a copy in eval mode
Uses “fbgemm” backend (optimized for x86 CPUs)

2. Insert Observers

example_inputs, _ = next(iter(calibration_loader))
prepared = prepare_fx(float_model, qconfig_mapping, example_inputs=(example_inputs,))

prepare_fx: Inserts observer modules to track activation ranges
Requires example inputs to trace the model graph

3. Calibration Loop

with torch.no_grad():
    for index, (inputs, _) in enumerate(calibration_loader):
        _ = prepared(inputs)
        if index + 1 >= calibration_batches:
            break

Runs forward passes to collect statistics
Number of batches controlled by calibration_batches parameter
From configs/default.yaml: calibration_batches: 8

4. Convert to INT8

quantized = convert_fx(prepared)
return quantized

Replaces FP32 ops with INT8 equivalents
Weights and activations are now quantized

The “fbgemm” backend is optimized for x86 CPUs. For ARM devices (Raspberry Pi, mobile), PyTorch will automatically select appropriate kernels, but performance varies by device.

Configuration Options

From configs/default.yaml:

precisions

list

default:"[fp32, fp16, int8]"

List of precisions to evaluate during optimization sweeps.

calibration_batches

integer

default:"8"

Number of batches to use for INT8 calibration. More batches improve quantization quality but increase calibration time.Recommended values:

Fast experimentation: 4-8 batches
Production: 16-32 batches
High accuracy needs: 50-100 batches

Performance Comparison

Memory Usage

SmallCNN (16/32 channels, Fashion-MNIST)

FP32: 2.85 MB
FP16: 1.43 MB (50% reduction)
INT8: 0.78 MB (73% reduction)

After 0.5 pruning + quantization:

FP32: 0.91 MB
FP16: 0.46 MB
INT8: 0.24 MB (92% total reduction)

Inference Latency

Raspberry Pi 4 (CPU inference)

FP32: 12.5 ms
FP16: 8.3 ms (1.5× speedup)
INT8: 4.2 ms (3× speedup)

Combined with 0.5 pruning:

FP32: 4.8 ms
FP16: 3.1 ms
INT8: 1.6 ms (7.8× total speedup)

Accuracy Impact

Fashion-MNIST SmallCNN (baseline: 89.0%)

Configuration	Accuracy	Δ from FP32
FP32, no pruning	89.0%	-
FP16, no pruning	88.9%	-0.1%
INT8, no pruning	88.4%	-0.6%
FP32, 0.5 pruning	87.2%	-1.8%
FP16, 0.5 pruning	87.1%	-1.9%
INT8, 0.5 pruning	86.3%	-2.7%

Key insight: Quantization and pruning effects are roughly additive.

Calibration Best Practices

Calibration data should be representative of your inference distribution. Using biased calibration data can lead to poor quantization and accuracy loss.

Choosing Calibration Data

import random
from torch.utils.data import Subset, DataLoader

# Option 1: Random subset of training data
num_samples = 1024
indices = random.sample(range(len(train_dataset)), num_samples)
calib_dataset = Subset(train_dataset, indices)

# Option 2: Use validation data directly
calib_dataset = val_dataset

# Option 3: Stratified sampling (better for imbalanced datasets)
from collections import defaultdict

class_indices = defaultdict(list)
for idx, (_, label) in enumerate(train_dataset):
    class_indices[label].append(idx)

samples_per_class = 100
balanced_indices = []
for label, indices in class_indices.items():
    balanced_indices.extend(random.sample(indices, min(samples_per_class, len(indices))))

calib_dataset = Subset(train_dataset, balanced_indices)

Calibration Batch Count

# Trade-off: Accuracy vs. Calibration Time

# Fast prototyping
int8_model = to_int8(model, calib_loader, calibration_batches=4)

# Balanced (recommended)
int8_model = to_int8(model, calib_loader, calibration_batches=16)

# High accuracy
int8_model = to_int8(model, calib_loader, calibration_batches=50)

Diminishing returns after ~32 batches for most models. More batches primarily helps with very diverse input distributions or models sensitive to initialization.

Combining Pruning and Quantization

Train Baseline Model

Train your model to convergence with FP32 precision.

Apply Pruning

Use structured_channel_prune() to reduce model size.

Apply Quantization

Quantize the pruned model with to_fp16() or to_int8().

Evaluate Trade-offs

Measure accuracy, latency, and memory on target hardware.

from edge_opt.pruning import structured_channel_prune
from edge_opt.quantization import to_fp16, to_int8

# Start with trained model
trained_model = load_trained_model()

# Step 1: Prune
pruned_model = structured_channel_prune(trained_model, pruning_level=0.5)

# Step 2: Quantize the pruned model
fp16_pruned = to_fp16(pruned_model)
int8_pruned = to_int8(pruned_model, calibration_loader, calibration_batches=16)

# Now you have three optimized variants:
# - pruned_model (FP32, 50% pruned)
# - fp16_pruned (FP16, 50% pruned)
# - int8_pruned (INT8, 50% pruned)

Troubleshooting

RuntimeError: Only FBGEMM is supported

This error occurs when INT8 quantization is attempted with an unsupported backend.Solution: The framework uses “fbgemm” backend by default, which works on x86 CPUs. For ARM devices, ensure you have a compatible PyTorch build.

# Check available backends
import torch
print(torch.backends.quantized.supported_engines)
# Expected: ['fbgemm', 'qnnpack']  # qnnpack for ARM

INT8 Model Slower Than FP32

INT8 quantization speedup depends on hardware support and model size.Reasons for no speedup:

Model is too small (overhead dominates)
CPU doesn’t have AVX-512 VNNI (x86) or NEON dotprod (ARM)
Missing optimized kernels for your operations

Solution: Test on target hardware, try larger models, or use FP16 instead.

Large Accuracy Drop After INT8

Accuracy loss > 3% usually indicates calibration issues.Solutions:

Increase calibration_batches (try 32-64)
Ensure calibration data is representative
Try quantization-aware training (not currently supported)
Use FP16 as a lower-impact alternative

Next Steps

Benchmark your quantized models: See the Benchmarking guide
Deploy to edge devices: Export optimized models for production
Tune calibration: Experiment with different calibration strategies
Combine techniques: Stack pruning and quantization for maximum efficiency

Get Started

Core Concepts

Guides

Hardware Analysis

Deployment

Overview

Supported Precisions

FP16 Quantization

Function Signature

Usage

FP16 Implementation Details

INT8 Quantization

Function Signature

Calibration Process

Usage

INT8 Implementation Details

1. Prepare Model

2. Insert Observers

3. Calibration Loop

4. Convert to INT8

Configuration Options

Performance Comparison

Calibration Best Practices

Choosing Calibration Data

Calibration Batch Count

Combining Pruning and Quantization

Troubleshooting

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Hardware Analysis

Deployment

​Overview

​Supported Precisions

​FP16 Quantization

​Function Signature

​Usage

​FP16 Implementation Details

​INT8 Quantization

​Function Signature

​Calibration Process

​Usage

​INT8 Implementation Details

​1. Prepare Model

​2. Insert Observers

​3. Calibration Loop

​4. Convert to INT8

​Configuration Options

​Performance Comparison

​Calibration Best Practices

​Choosing Calibration Data

​Calibration Batch Count

​Combining Pruning and Quantization

​Troubleshooting

​Next Steps

Build docs developers (and LLMs) love

Overview

Supported Precisions

FP16 Quantization

Function Signature

Usage

FP16 Implementation Details

INT8 Quantization

Function Signature

Calibration Process

Usage

INT8 Implementation Details

1. Prepare Model

2. Insert Observers

3. Calibration Loop

4. Convert to INT8

Configuration Options

Performance Comparison

Calibration Best Practices

Choosing Calibration Data

Calibration Batch Count

Combining Pruning and Quantization

Troubleshooting

Next Steps