Overview
Quantization converts high-precision floating-point numbers to lower-precision representations:- FP32 → FP16: Reduces memory by 50%, speeds up inference on hardware with FP16 support
- FP32 → INT8: Reduces memory by 75%, provides 2-4× speedup on CPU and edge accelerators
Supported Precisions
| Precision | Bytes per Weight | Memory Ratio | Typical Speedup | Accuracy Impact |
|---|---|---|---|---|
| FP32 | 4 | 1.0× | 1.0× | Baseline |
| FP16 | 2 | 0.5× | 1.5-2× | < 0.1% |
| INT8 | 1 | 0.25× | 2-4× | 0.5-2% |
FP16 Quantization
FP16 (half precision) quantization is the simplest form of quantization, converting all model parameters and activations from 32-bit to 16-bit floating-point format.Function Signature
src/edge_opt/quantization.py:12-14
Usage
FP16 Implementation Details
Theto_fp16 function performs three operations:
- Deep copy: Creates an independent copy to avoid modifying the original model
- Half precision conversion: Calls
.half()to convert all parameters totorch.float16 - Eval mode: Sets the model to evaluation mode with
.eval()
FP16 models require FP16 input tensors. The framework handles this automatically when you specify
precision='fp16' in evaluation functions.INT8 Quantization
INT8 quantization uses PyTorch’s FX Graph Mode Quantization to convert models to 8-bit integer precision. This requires a calibration process to determine optimal quantization parameters.Function Signature
src/edge_opt/quantization.py:17-30
Calibration Process
INT8 quantization requires calibration to collect activation statistics:Prepare Model for Quantization
Insert observer modules to track activation ranges during calibration.
Usage
INT8 Implementation Details
The quantization process insrc/edge_opt/quantization.py:17-30 follows these steps:
1. Prepare Model
- Creates a copy in eval mode
- Uses “fbgemm” backend (optimized for x86 CPUs)
2. Insert Observers
prepare_fx: Inserts observer modules to track activation ranges- Requires example inputs to trace the model graph
3. Calibration Loop
- Runs forward passes to collect statistics
- Number of batches controlled by
calibration_batchesparameter - From
configs/default.yaml:calibration_batches: 8
4. Convert to INT8
- Replaces FP32 ops with INT8 equivalents
- Weights and activations are now quantized
The “fbgemm” backend is optimized for x86 CPUs. For ARM devices (Raspberry Pi, mobile), PyTorch will automatically select appropriate kernels, but performance varies by device.
Configuration Options
Fromconfigs/default.yaml:
List of precisions to evaluate during optimization sweeps.
Number of batches to use for INT8 calibration. More batches improve quantization quality but increase calibration time.Recommended values:
- Fast experimentation: 4-8 batches
- Production: 16-32 batches
- High accuracy needs: 50-100 batches
Performance Comparison
Memory Usage
Memory Usage
SmallCNN (16/32 channels, Fashion-MNIST)
- FP32: 2.85 MB
- FP16: 1.43 MB (50% reduction)
- INT8: 0.78 MB (73% reduction)
- FP32: 0.91 MB
- FP16: 0.46 MB
- INT8: 0.24 MB (92% total reduction)
Inference Latency
Inference Latency
Raspberry Pi 4 (CPU inference)
- FP32: 12.5 ms
- FP16: 8.3 ms (1.5× speedup)
- INT8: 4.2 ms (3× speedup)
- FP32: 4.8 ms
- FP16: 3.1 ms
- INT8: 1.6 ms (7.8× total speedup)
Accuracy Impact
Accuracy Impact
Fashion-MNIST SmallCNN (baseline: 89.0%)
Key insight: Quantization and pruning effects are roughly additive.
| Configuration | Accuracy | Δ from FP32 |
|---|---|---|
| FP32, no pruning | 89.0% | - |
| FP16, no pruning | 88.9% | -0.1% |
| INT8, no pruning | 88.4% | -0.6% |
| FP32, 0.5 pruning | 87.2% | -1.8% |
| FP16, 0.5 pruning | 87.1% | -1.9% |
| INT8, 0.5 pruning | 86.3% | -2.7% |
Calibration Best Practices
Choosing Calibration Data
Calibration Batch Count
Diminishing returns after ~32 batches for most models. More batches primarily helps with very diverse input distributions or models sensitive to initialization.
Combining Pruning and Quantization
Troubleshooting
RuntimeError: Only FBGEMM is supported
RuntimeError: Only FBGEMM is supported
This error occurs when INT8 quantization is attempted with an unsupported backend.Solution: The framework uses “fbgemm” backend by default, which works on x86 CPUs. For ARM devices, ensure you have a compatible PyTorch build.
INT8 Model Slower Than FP32
INT8 Model Slower Than FP32
INT8 quantization speedup depends on hardware support and model size.Reasons for no speedup:
- Model is too small (overhead dominates)
- CPU doesn’t have AVX-512 VNNI (x86) or NEON dotprod (ARM)
- Missing optimized kernels for your operations
Large Accuracy Drop After INT8
Large Accuracy Drop After INT8
Accuracy loss > 3% usually indicates calibration issues.Solutions:
- Increase
calibration_batches(try 32-64) - Ensure calibration data is representative
- Try quantization-aware training (not currently supported)
- Use FP16 as a lower-impact alternative
Next Steps
- Benchmark your quantized models: See the Benchmarking guide
- Deploy to edge devices: Export optimized models for production
- Tune calibration: Experiment with different calibration strategies
- Combine techniques: Stack pruning and quantization for maximum efficiency