Overview
The Edge AI Hardware Optimization framework implements two complementary optimization techniques:- Structured Channel Pruning: Removes entire convolutional channels to reduce parameters and compute while maintaining dense tensor operations
- Precision Quantization: Converts models to lower-precision formats (FP16, INT8) to reduce memory footprint and accelerate inference
src/edge_opt/pruning.py and src/edge_opt/quantization.py respectively.
Structured Channel Pruning
Algorithm Overview
Structured channel pruning removes whole output channels from convolutional layers based on L1-norm importance scores. Unlike unstructured weight pruning, this approach:- Produces dense tensors compatible with standard hardware accelerators
- Reduces actual runtime memory and FLOPs (not just parameter count)
- Requires no specialized sparse kernel support
- Maintains straightforward deployment compatibility
Implementation: structured_channel_prune
The pruning implementation in src/edge_opt/pruning.py:14 removes channels from both convolutional layers while preserving connectivity:
The pruning level is a fraction of channels to remove, not keep.
pruning_level=0.3 removes 30% of channels, keeping 70%.Channel Selection Strategy
The_topk_indices helper function selects channels to keep based on importance scores:
- L1-norm scoring: Sum of absolute weights across spatial dimensions (H, W) and input channels
- Top-k selection: Keep channels with highest L1-norm scores
- Minimum retention: Always keep at least 1 channel even with high pruning levels
- Sorted indices: Maintain channel order for deterministic behavior
Weight Transfer Process
The pruned model is initialized with weights from selected channels:Conv1 Layer Transfer
Copy weights and biases for selected conv1 output channels:Shape transformation:
[C_out_original, C_in, H, W] → [C_out_pruned, C_in, H, W]Conv2 Layer Transfer
Copy weights for selected conv2 output channels AND input channels (must match conv1 outputs):Shape transformation:
[C2_out_original, C1_out_original, H, W] → [C2_out_pruned, C1_out_pruned, H, W]Usage Example
Precision Quantization
Supported Precision Modes
The framework supports three precision modes for model evaluation:FP32
32-bit Floating PointBaseline precision with no conversion overhead. Used as reference for accuracy and performance comparisons.
FP16
16-bit Floating PointHalf-precision reduces memory by 50% with minimal accuracy impact. Requires hardware FP16 support for performance gains.
INT8
8-bit IntegerQuantized precision reduces memory by 75% with potential accuracy loss. Requires calibration data for activation range estimation.
FP16 Conversion: to_fp16
The FP16 conversion is straightforward using PyTorch’s .half() method:
- Uses
deepcopyto avoid modifying the original model - Automatically converts all parameters and buffers to FP16
- Sets model to evaluation mode (disables dropout, batch norm training mode)
- Input tensors must also be converted to FP16 during inference
INT8 Quantization: to_int8
INT8 quantization uses PyTorch’s FX graph mode quantization with post-training static quantization (PTQ):
Quantization Workflow
Model Preparation
Create a deep copy in evaluation mode and configure the
fbgemm quantization backend for x86 CPU targets.For ARM targets, use
"qnnpack" backend instead of "fbgemm".Observer Insertion
Insert activation observers using FX graph mode preparation:Observers track min/max ranges during calibration to determine quantization scale and zero-point.
Precision Selection in Sweep
Theexperiments.precision_variant function handles precision conversion during sweeps:
INT8 models use
metric_precision="fp32" because input tensors remain in FP32 format. The quantized model internally converts inputs to INT8.Trade-off Analysis
Thehardware.precision_tradeoff_table function aggregates sweep results by precision mode:
precision_tradeoffs.csv with mean performance across all pruning levels for each precision mode.
Expected Trade-offs
Accuracy vs Compression
Accuracy vs Compression
- FP32: Baseline accuracy (no degradation)
- FP16: Minimal accuracy loss (<0.1% typical for CNNs)
- INT8: 0.5-2% accuracy degradation depending on calibration quality
Memory Footprint
Memory Footprint
- FP32: 4 bytes per parameter
- FP16: 2 bytes per parameter (50% reduction)
- INT8: 1 byte per parameter (75% reduction)
Latency and Throughput
Latency and Throughput
- FP32: Baseline latency
- FP16: 1.5-2x speedup on hardware with FP16 SIMD support
- INT8: 2-4x speedup on CPUs with AVX-512 VNNI or equivalent
Acceptance Ratio
Acceptance Ratio
The
accepted_ratio metric shows the fraction of configurations passing the active memory budget:- Tighter budgets favor INT8 and aggressive pruning
- FP32 with low pruning typically fails strict memory constraints
- FP16 provides middle ground between accuracy and memory
Best Practices
Pruning First, Then Quantize
Apply pruning before quantization to reduce calibration overhead and improve quantization quality on the reduced parameter space.
Calibration Data Quality
Use representative calibration data matching the deployment distribution. Poor calibration leads to clipped activations and accuracy degradation.
Validate Acceptance Ratio
Check
precision_tradeoffs.csv acceptance ratios to ensure sufficient candidates pass memory budgets. Low ratios indicate budget constraints are too strict.Monitor P95 Latency
Evaluate P95 latency distributions, not just mean latency. Tail latencies often dominate real-time deployment constraints.
Next Steps
Hardware Constraints
Learn about memory budgets and constraint filtering
System Architecture
Understand the full optimization pipeline