Overview
The Edge AI Hardware Optimization framework implements hardware-aware constraint modeling to simulate edge device limitations. The system tracks:- Memory Budgets: SRAM-style memory constraints for model footprint validation
- Bandwidth Utilization: Estimated memory bandwidth consumption from parameter and activation transfers
- CPU Frequency Scaling: Latency adjustments to model lower-frequency edge processors
- Layer-wise Resource Analysis: Per-layer activation memory and compute (MACs) breakdown
src/edge_opt/hardware.py and src/edge_opt/metrics.py.
Memory Budget Constraints
Configuration
Memory budgets are specified in the experiment configuration YAML:memory_budgets_mb: List of budget thresholds for violation reporting (soft constraints)active_memory_budget_mb: Single threshold for accept/reject classification (hard constraint)
The active memory budget acts as the primary constraint filter. Candidates exceeding this limit are marked as rejected and excluded from Pareto frontier generation.
Memory Footprint Calculation
Themetrics.model_memory_mb function computes model size from the state dictionary:
tensor.numel(): Number of elements in the tensortensor.element_size(): Bytes per element (4 for FP32, 2 for FP16, 1 for INT8)- Includes all parameters (weights, biases) and buffers (running stats, etc.)
- Does not include activation memory (see layer-wise analysis below)
Violation Detection
Themetrics.memory_violations function checks model size against all configured budgets:
budgets_mb=[0.5, 1.0, 2.0, 5.0]:
Constraint Filtering in Sweep
During the optimization sweep, each candidate is evaluated against the active budget:Layer-wise Resource Analysis
Estimation Framework
Thehardware.estimate_layerwise_stats function computes per-layer resource consumption:
Output Shape Calculation
The_conv2d_output_shape helper computes spatial dimensions after convolution:
- Conv1 (3×3, pad=1): 28×28 → 28×28 → MaxPool → 14×14
- Conv2 (3×3, pad=1): 14×14 → 14×14 → MaxPool → 7×7
- Classifier: 32×7×7 = 1568 → 10
Layer-wise Metrics
The function returns a DataFrame with per-layer breakdown:| layer | output_elements | parameter_bytes | activation_bytes | macs |
|---|---|---|---|---|
| conv1 | batch × 16 × 28 × 28 | (1×16×3×3 + 16) × 4 | batch × 16 × 28 × 28 × 4 | batch × 16 × 28 × 28 × 1 × 9 |
| conv2 | batch × 32 × 14 × 14 | (16×32×3×3 + 32) × 4 | batch × 32 × 14 × 14 × 4 | batch × 32 × 14 × 14 × 16 × 9 |
| classifier | batch × 10 | (1568×10 + 10) × 4 | batch × 10 × 4 | batch × 1568 × 10 |
Example: Baseline Model (batch_size=64)
Example: Baseline Model (batch_size=64)
With
SmallCNN(conv1_channels=16, conv2_channels=32) and batch_size=64:- Conv1:
- Output elements: 64 × 16 × 28 × 28 = 802,816
- Parameter bytes: (16×3×3 + 16) × 4 = 640 bytes
- Activation bytes: 802,816 × 4 = 3.05 MB
- MACs: 64 × 16 × 28 × 28 × 1 × 9 = 7,225,344
- Conv2:
- Output elements: 64 × 32 × 14 × 14 = 401,408
- Parameter bytes: (16×32×3×3 + 32) × 4 = 18,560 bytes
- Activation bytes: 401,408 × 4 = 1.53 MB
- MACs: 64 × 32 × 14 × 14 × 16 × 9 = 231,211,008
- Classifier:
- Output elements: 64 × 10 = 640
- Parameter bytes: (1568×10 + 10) × 4 = 62,760 bytes
- Activation bytes: 640 × 4 = 2,560 bytes
- MACs: 64 × 1568 × 10 = 1,003,520
Activation memory is reported per-layer and assumes no in-place operations or memory reuse. Actual runtime peak memory depends on framework optimizations and graph execution order.
Bandwidth Utilization
Hardware Summary Metrics
Thehardware.summarize_hardware function estimates memory bandwidth consumption:
Bandwidth Metrics Explained
Estimated Total Bytes
Sum of parameter bytes and activation bytes across all layers. Approximates total memory traffic per inference.
Achieved Bandwidth
total_bytes / latency_seconds converted to GB/s. Represents effective memory bandwidth consumed during inference.Bandwidth Utilization
achieved_bandwidth / configured_bandwidth. Values near 1.0 indicate memory-bound operations; low values suggest compute-bound or cache-resident workloads.Achieved GMAC/s
total_macs / latency_seconds in billions. Indicates computational throughput. Compare with theoretical peak GFLOP/s to assess hardware efficiency.CPU Frequency Scaling
Latency Multiplier
Thecpu_frequency_scale configuration parameter models lower-frequency edge processors:
Frequency Scaling Assumptions:
- Linear latency scaling:
latency_scaled = latency_base × scale - Assumes memory-bound operations scale proportionally
- Does not model voltage scaling effects on power consumption
- Ideal for comparing relative performance at different frequencies
Latency Distribution Measurement
Themetrics.measure_latency_distribution function captures statistical latency behavior:
- Run
warmupiterations to stabilize CPU cache and frequency scaling - Measure
num_runsiterations and compute average - Repeat measurement
repeatstimes to capture variability - Report mean, standard deviation, and 95th percentile
Default Measurement Settings
Default Measurement Settings
From Hard-coded in
configs/default.yaml defaults:measure_latency_distribution:num_runs: 100iterations per windowwarmup: 10iterations before each window
(100 + 10) × 5 = 550Hardware Artifacts
Output Files
Thehardware.save_hardware_artifacts function generates analysis reports:
layerwise_breakdown.csv
Per-layer resource consumption:
| layer | output_elements | parameter_bytes | activation_bytes | macs |
|---|---|---|---|---|
| conv1 | 802816 | 640 | 3211264 | 7225344 |
| conv2 | 401408 | 18560 | 1605632 | 231211008 |
| classifier | 640 | 62760 | 2560 | 1003520 |
precision_tradeoffs.csv
Aggregated metrics by precision mode:
| precision | accuracy_mean | latency_ms_mean | memory_mb_mean | energy_proxy_j_mean | accepted_ratio |
|---|---|---|---|---|---|
| int8 | 0.972 | 2.45 | 0.21 | 0.0049 | 1.00 |
| fp16 | 0.985 | 3.12 | 0.42 | 0.0062 | 0.75 |
| fp32 | 0.987 | 4.87 | 0.84 | 0.0097 | 0.25 |
Visualization Plots
The artifact generation includes three plots:Layer-wise Activation Memory
Bar chart showing activation memory (MB) per layer. Highlights memory-intensive layers for optimization targeting.File:
layerwise_activation_memory.pngLayer-wise Compute (MACs)
Bar chart showing multiply-accumulate operations (millions) per layer. Identifies compute bottlenecks.File:
layerwise_macs.pngBest Practices
Set Realistic Budgets
Configure
active_memory_budget_mb to 60-80% of true device SRAM to account for runtime overhead and activation buffers.Monitor Acceptance Ratios
Check
precision_tradeoffs.csv to ensure sufficient candidates pass budgets. Ratios below 0.5 indicate overly strict constraints.Validate Bandwidth Estimates
Low bandwidth utilization (<0.1) suggests compute-bound or cache-resident workloads. High utilization (>0.8) indicates memory-bound operations.
Compare Relative Performance
Use hardware metrics for relative comparison across configurations. Validate absolute numbers with hardware counters on target devices.
Limitations and Caveats
What is NOT modeled
What is NOT modeled
- Cache effects: L1/L2/L3 cache hits, misses, and prefetching
- Memory reuse: In-place operations and activation buffer recycling
- Kernel overhead: Operator dispatch, context switching, and synchronization
- Hardware counters: Actual memory transactions, instruction counts, and stall cycles
- Thermal throttling: Dynamic frequency scaling due to thermal limits
- Co-scheduled workloads: Host-level contention from concurrent processes
INT8 Backend Variability
INT8 Backend Variability
The default
fbgemm backend targets x86 CPUs with AVX-512 VNNI support. Performance on other platforms:- ARM CPUs: Use
qnnpackbackend (requires config change) - Older x86: May fall back to slower INT8 emulation
- Accelerators: GPU/NPU quantization requires different quantization APIs
Activation Memory vs Peak Memory
Activation Memory vs Peak Memory
Layer-wise activation bytes report output footprint only, not peak runtime memory:
- Frameworks may reuse buffers across layers
- Gradient storage is not applicable (inference only)
- Temporary buffers for operator fusion are not tracked
torch.cuda.max_memory_allocated() on GPU or OS-level profiling tools on CPU.Next Steps
Model Optimization
Learn how pruning and quantization reduce memory footprint
System Architecture
Understand constraint filtering in the pipeline