Skip to main content
Quantization reduces the numerical precision of model weights and activations, significantly decreasing memory footprint and inference latency while maintaining acceptable accuracy. The Edge AI Hardware Optimization framework supports FP16 (half precision) and INT8 (8-bit integer) quantization.

Overview

Quantization converts high-precision floating-point numbers to lower-precision representations:
  • FP32 → FP16: Reduces memory by 50%, speeds up inference on hardware with FP16 support
  • FP32 → INT8: Reduces memory by 75%, provides 2-4× speedup on CPU and edge accelerators

Supported Precisions

# Available precision modes
PRECISIONS = ['fp32', 'fp16', 'int8']

# Default configuration from configs/default.yaml
precisions: [fp32, fp16, int8]
PrecisionBytes per WeightMemory RatioTypical SpeedupAccuracy Impact
FP3241.0×1.0×Baseline
FP1620.5×1.5-2×< 0.1%
INT810.25×2-4×0.5-2%

FP16 Quantization

FP16 (half precision) quantization is the simplest form of quantization, converting all model parameters and activations from 32-bit to 16-bit floating-point format.

Function Signature

def to_fp16(model: nn.Module) -> nn.Module:
    """Convert model to FP16 precision.
    
    Args:
        model: Input PyTorch model in FP32
        
    Returns:
        New model with FP16 weights and activations
    """
    fp16_model = deepcopy(model).half().eval()
    return fp16_model
Defined in src/edge_opt/quantization.py:12-14

Usage

import torch
from edge_opt.model import SmallCNN
from edge_opt.quantization import to_fp16

# Load trained model
model = SmallCNN()
model.load_state_dict(torch.load('trained_model.pth'))

# Convert to FP16
fp16_model = to_fp16(model)

# Model is now in half precision and eval mode
print(f"Original dtype: {next(model.parameters()).dtype}")  # torch.float32
print(f"FP16 dtype: {next(fp16_model.parameters()).dtype}")     # torch.float16

FP16 Implementation Details

The to_fp16 function performs three operations:
  1. Deep copy: Creates an independent copy to avoid modifying the original model
  2. Half precision conversion: Calls .half() to convert all parameters to torch.float16
  3. Eval mode: Sets the model to evaluation mode with .eval()
# From src/edge_opt/quantization.py:12-14
fp16_model = deepcopy(model).half().eval()
FP16 models require FP16 input tensors. The framework handles this automatically when you specify precision='fp16' in evaluation functions.

INT8 Quantization

INT8 quantization uses PyTorch’s FX Graph Mode Quantization to convert models to 8-bit integer precision. This requires a calibration process to determine optimal quantization parameters.

Function Signature

def to_int8(
    model: nn.Module,
    calibration_loader: DataLoader,
    calibration_batches: int = 10
) -> nn.Module:
    """Convert model to INT8 precision using post-training quantization.
    
    Args:
        model: Input PyTorch model in FP32
        calibration_loader: DataLoader for calibration data
        calibration_batches: Number of batches for calibration
        
    Returns:
        Quantized model with INT8 weights and activations
    """
Defined in src/edge_opt/quantization.py:17-30

Calibration Process

INT8 quantization requires calibration to collect activation statistics:
1

Prepare Model for Quantization

Insert observer modules to track activation ranges during calibration.
2

Run Calibration Data

Pass calibration batches through the model to collect min/max statistics.
3

Calculate Quantization Parameters

Determine optimal scale and zero-point for each quantized layer.
4

Convert to INT8

Replace FP32 operations with INT8 equivalents using the computed parameters.

Usage

import torch
import yaml
from torch.utils.data import DataLoader
from edge_opt.quantization import to_int8

# Load configuration
with open('configs/default.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Create calibration dataloader
# Use a subset of training or validation data
calibration_loader = DataLoader(
    calibration_dataset,
    batch_size=config['batch_size'],
    shuffle=False
)

# Convert to INT8
int8_model = to_int8(
    model=trained_model,
    calibration_loader=calibration_loader,
    calibration_batches=config['calibration_batches']  # default: 8
)

print("Model quantized to INT8")

INT8 Implementation Details

The quantization process in src/edge_opt/quantization.py:17-30 follows these steps:

1. Prepare Model

float_model = deepcopy(model).eval()
qconfig_mapping = get_default_qconfig_mapping("fbgemm")
  • Creates a copy in eval mode
  • Uses “fbgemm” backend (optimized for x86 CPUs)

2. Insert Observers

example_inputs, _ = next(iter(calibration_loader))
prepared = prepare_fx(float_model, qconfig_mapping, example_inputs=(example_inputs,))
  • prepare_fx: Inserts observer modules to track activation ranges
  • Requires example inputs to trace the model graph

3. Calibration Loop

with torch.no_grad():
    for index, (inputs, _) in enumerate(calibration_loader):
        _ = prepared(inputs)
        if index + 1 >= calibration_batches:
            break
  • Runs forward passes to collect statistics
  • Number of batches controlled by calibration_batches parameter
  • From configs/default.yaml: calibration_batches: 8

4. Convert to INT8

quantized = convert_fx(prepared)
return quantized
  • Replaces FP32 ops with INT8 equivalents
  • Weights and activations are now quantized
The “fbgemm” backend is optimized for x86 CPUs. For ARM devices (Raspberry Pi, mobile), PyTorch will automatically select appropriate kernels, but performance varies by device.

Configuration Options

From configs/default.yaml:
precisions
list
default:"[fp32, fp16, int8]"
List of precisions to evaluate during optimization sweeps.
calibration_batches
integer
default:"8"
Number of batches to use for INT8 calibration. More batches improve quantization quality but increase calibration time.Recommended values:
  • Fast experimentation: 4-8 batches
  • Production: 16-32 batches
  • High accuracy needs: 50-100 batches

Performance Comparison

SmallCNN (16/32 channels, Fashion-MNIST)
  • FP32: 2.85 MB
  • FP16: 1.43 MB (50% reduction)
  • INT8: 0.78 MB (73% reduction)
After 0.5 pruning + quantization:
  • FP32: 0.91 MB
  • FP16: 0.46 MB
  • INT8: 0.24 MB (92% total reduction)
Raspberry Pi 4 (CPU inference)
  • FP32: 12.5 ms
  • FP16: 8.3 ms (1.5× speedup)
  • INT8: 4.2 ms (3× speedup)
Combined with 0.5 pruning:
  • FP32: 4.8 ms
  • FP16: 3.1 ms
  • INT8: 1.6 ms (7.8× total speedup)
Fashion-MNIST SmallCNN (baseline: 89.0%)
ConfigurationAccuracyΔ from FP32
FP32, no pruning89.0%-
FP16, no pruning88.9%-0.1%
INT8, no pruning88.4%-0.6%
FP32, 0.5 pruning87.2%-1.8%
FP16, 0.5 pruning87.1%-1.9%
INT8, 0.5 pruning86.3%-2.7%
Key insight: Quantization and pruning effects are roughly additive.

Calibration Best Practices

Calibration data should be representative of your inference distribution. Using biased calibration data can lead to poor quantization and accuracy loss.

Choosing Calibration Data

import random
from torch.utils.data import Subset, DataLoader

# Option 1: Random subset of training data
num_samples = 1024
indices = random.sample(range(len(train_dataset)), num_samples)
calib_dataset = Subset(train_dataset, indices)

# Option 2: Use validation data directly
calib_dataset = val_dataset

# Option 3: Stratified sampling (better for imbalanced datasets)
from collections import defaultdict

class_indices = defaultdict(list)
for idx, (_, label) in enumerate(train_dataset):
    class_indices[label].append(idx)

samples_per_class = 100
balanced_indices = []
for label, indices in class_indices.items():
    balanced_indices.extend(random.sample(indices, min(samples_per_class, len(indices))))

calib_dataset = Subset(train_dataset, balanced_indices)

Calibration Batch Count

# Trade-off: Accuracy vs. Calibration Time

# Fast prototyping
int8_model = to_int8(model, calib_loader, calibration_batches=4)

# Balanced (recommended)
int8_model = to_int8(model, calib_loader, calibration_batches=16)

# High accuracy
int8_model = to_int8(model, calib_loader, calibration_batches=50)
Diminishing returns after ~32 batches for most models. More batches primarily helps with very diverse input distributions or models sensitive to initialization.

Combining Pruning and Quantization

1

Train Baseline Model

Train your model to convergence with FP32 precision.
2

Apply Pruning

Use structured_channel_prune() to reduce model size.
3

Apply Quantization

Quantize the pruned model with to_fp16() or to_int8().
4

Evaluate Trade-offs

Measure accuracy, latency, and memory on target hardware.
from edge_opt.pruning import structured_channel_prune
from edge_opt.quantization import to_fp16, to_int8

# Start with trained model
trained_model = load_trained_model()

# Step 1: Prune
pruned_model = structured_channel_prune(trained_model, pruning_level=0.5)

# Step 2: Quantize the pruned model
fp16_pruned = to_fp16(pruned_model)
int8_pruned = to_int8(pruned_model, calibration_loader, calibration_batches=16)

# Now you have three optimized variants:
# - pruned_model (FP32, 50% pruned)
# - fp16_pruned (FP16, 50% pruned)
# - int8_pruned (INT8, 50% pruned)

Troubleshooting

This error occurs when INT8 quantization is attempted with an unsupported backend.Solution: The framework uses “fbgemm” backend by default, which works on x86 CPUs. For ARM devices, ensure you have a compatible PyTorch build.
# Check available backends
import torch
print(torch.backends.quantized.supported_engines)
# Expected: ['fbgemm', 'qnnpack']  # qnnpack for ARM
INT8 quantization speedup depends on hardware support and model size.Reasons for no speedup:
  • Model is too small (overhead dominates)
  • CPU doesn’t have AVX-512 VNNI (x86) or NEON dotprod (ARM)
  • Missing optimized kernels for your operations
Solution: Test on target hardware, try larger models, or use FP16 instead.
Accuracy loss > 3% usually indicates calibration issues.Solutions:
  • Increase calibration_batches (try 32-64)
  • Ensure calibration data is representative
  • Try quantization-aware training (not currently supported)
  • Use FP16 as a lower-impact alternative

Next Steps

  • Benchmark your quantized models: See the Benchmarking guide
  • Deploy to edge devices: Export optimized models for production
  • Tune calibration: Experiment with different calibration strategies
  • Combine techniques: Stack pruning and quantization for maximum efficiency

Build docs developers (and LLMs) love