Overview
The Edge AI Hardware Optimization framework implements structured channel pruning, which removes entire convolutional channels based on their importance scores. This approach is more hardware-friendly than unstructured pruning because it maintains dense tensor operations that are well-supported by edge device accelerators.How It Works
Structured channel pruning works by:Calculate Channel Importance
Compute importance scores for each channel by summing the absolute values of weights across the channel dimensions.
Select Top Channels
Keep only the top-k most important channels based on the pruning level (e.g., keep 75% of channels for 0.25 pruning).
Create Pruned Model
Instantiate a new model with reduced channel dimensions and copy weights from the selected channels.
Function Signature
The main pruning function is defined insrc/edge_opt/pruning.py:
Pruning Levels
Thepruning_level parameter controls the aggressiveness of pruning:
Fraction of channels to remove from each convolutional layer. Must be in the range [0.0, 1.0).
- 0.0: No pruning (baseline model)
- 0.25: Remove 25% of channels (mild pruning)
- 0.5: Remove 50% of channels (moderate pruning)
- 0.7: Remove 70% of channels (aggressive pruning)
- 0.9: Remove 90% of channels (extreme pruning)
Basic Usage
Implementation Details
The pruning algorithm insrc/edge_opt/pruning.py:14-45 works as follows:
Channel Importance Scoring
- Input channels (dim=1)
- Kernel height (dim=2)
- Kernel width (dim=3)
Top-K Channel Selection
- Calculates how many channels to keep:
keep = total × (1 - pruning_level) - Selects the top-k highest-scoring channels
- Sorts indices to maintain channel order
- Ensures at least 1 channel is kept
Weight Transfer
- Conv1 output channels: Only kept channels are copied
- Conv2 input channels: Must match conv1 output (uses
keep1) - Conv2 output channels: Only kept channels are copied (uses
keep2)
Fully Connected Layer Adjustment
Performance Impact
Memory Reduction
Memory Reduction
Pruning reduces model memory quadratically in many cases because it affects both the pruned layer and subsequent layers:Example: SmallCNN with 0.5 pruning
- Conv1: 16 → 8 channels
- Conv2 weights: (32, 16, 3, 3) → (16, 8, 3, 3)
- Reduction: ~4× fewer parameters in conv2
- 0.25 pruning: ~40-50% reduction
- 0.5 pruning: ~60-70% reduction
- 0.7 pruning: ~80-85% reduction
Latency Improvement
Latency Improvement
Latency improvements are roughly proportional to the compute reduction:MACs (Multiply-Accumulate Operations)
- 0.25 pruning: ~44% fewer MACs
- 0.5 pruning: ~75% fewer MACs
- 0.7 pruning: ~91% fewer MACs
- Raspberry Pi: 30-70% faster
- Mobile CPU: 25-60% faster
- GPU (less benefit): 10-40% faster
Accuracy Trade-offs
Accuracy Trade-offs
Accuracy degradation depends on model capacity and training:Fashion-MNIST SmallCNN baseline: ~89% accuracy
- 0.0 pruning: 89.0% (baseline)
- 0.25 pruning: 88.5% (-0.5%)
- 0.5 pruning: 87.2% (-1.8%)
- 0.7 pruning: 84.1% (-4.9%)
- 0.9 pruning: 76.3% (-12.7%)
Validation and Error Handling
Best Practices
Always prune after training: Prune a fully trained model rather than training a pruned architecture from scratch. This preserves the learned feature representations.
Start conservative: Begin with low pruning levels (0.25-0.5) and gradually increase while monitoring accuracy on your validation set.
Combine with quantization: Pruning and quantization are complementary techniques. Apply pruning first, then quantize the pruned model for maximum compression.
Complete Example
Next Steps
After pruning your model:- Apply quantization - Further reduce memory and latency with FP16 or INT8 precision
- Benchmark performance - Measure real-world latency on target hardware
- Fine-tune if needed - Retrain the pruned model briefly to recover any accuracy loss
- Deploy to edge - Export and deploy the optimized model to your target device