Overview
Thesummarize_hardware function calculates achieved memory bandwidth by dividing total data movement (parameters + activations) by inference latency. Comparing this against your device’s theoretical bandwidth reveals whether your model is bottlenecked by memory transfers or compute operations.
Bandwidth utilization above 80% typically indicates memory-bound execution. Values below 20% suggest compute-bound or inefficient memory access patterns.
Function Signature
Parameters
DataFrame from
estimate_layerwise_stats containing per-layer memory and compute estimates. Must have parameter_bytes, activation_bytes, and macs columns.Measured inference latency in milliseconds from actual benchmarking. Used to calculate achieved bandwidth.
Theoretical peak memory bandwidth of the target device in gigabytes per second. Reference values:
- Raspberry Pi 4: ~4.3 GB/s (LPDDR4)
- NVIDIA Jetson Nano: ~25.6 GB/s
- Desktop DDR4: ~40-60 GB/s
Returns
Type:dict[str, float]
A dictionary containing hardware performance metrics:
Sum of all parameter bytes and activation bytes across layers. Represents total data movement.
Sum of multiply-accumulate operations across all layers.
Actual bandwidth achieved during inference:
(total_bytes / latency_s) / 1e9The theoretical bandwidth value you provided (echoed for reference).
Ratio of achieved to theoretical bandwidth:
achieved_bandwidth_gbps / configured_memory_bandwidth_gbps. Values range from 0.0 to 1.0+.Billions of MAC operations per second:
(total_macs / latency_s) / 1e9Example Usage
Interpreting Results
Bandwidth Utilization Patterns
High utilization (>70%)
High utilization (>70%)
Memory-bound execution. Your model spends most time moving data rather than computing. Optimization strategies:
- Reduce activation memory through channel pruning
- Increase compute intensity (deeper layers, fewer memory transfers)
- Use lower precision formats (FP16, INT8) to reduce data movement
- Enable memory access optimizations (tiling, prefetching)
Moderate utilization (30-70%)
Moderate utilization (30-70%)
Balanced workload. Both compute and memory contribute to latency. Consider:
- Profile individual layers to find specific bottlenecks
- Experiment with batch size adjustments
- Test different precision modes for accuracy vs speed tradeoffs
Low utilization (<30%)
Low utilization (<30%)
Compute-bound or inefficient access. Possible causes:
- Small model with low memory requirements
- CPU compute limitations (low frequency, few cores)
- Cache thrashing or poor memory access patterns
- Framework overhead dominating execution time
Implementation Details
The function performs these calculations:Common Scenarios
Comparing Device Candidates
Tracking Optimization Impact
Pipeline Integration
Inscripts/run_pipeline.py:83-87, hardware summaries are computed for the baseline model:
- Saved to
outputs/hardware_summary.csv - Merged into the deployment summary dictionary
- Included in the final JSON output
Limitations
Simplifying assumptions:
- Treats all memory accesses as sequential (ignores caching effects)
- Assumes single-threaded execution
- Does not account for framework overhead or kernel launch costs
- Parameter bytes are counted once (doesn’t model reuse across batches)
Related Functions
Layer-wise Analysis
Generate the input DataFrame for bandwidth calculations
Performance Metrics
Collect latency measurements for bandwidth estimation
Source Reference
Implementation:src/edge_opt/hardware.py:73-91