Skip to main content

Overview

The summarize_hardware function calculates achieved memory bandwidth by dividing total data movement (parameters + activations) by inference latency. Comparing this against your device’s theoretical bandwidth reveals whether your model is bottlenecked by memory transfers or compute operations.
Bandwidth utilization above 80% typically indicates memory-bound execution. Values below 20% suggest compute-bound or inefficient memory access patterns.

Function Signature

from edge_opt.hardware import summarize_hardware

hardware_summary = summarize_hardware(
    layerwise_df=layerwise_df,
    latency_ms=baseline_metrics.latency_ms,
    memory_bandwidth_gbps=cfg.memory_bandwidth_gbps
)

Parameters

layerwise_df
pd.DataFrame
required
DataFrame from estimate_layerwise_stats containing per-layer memory and compute estimates. Must have parameter_bytes, activation_bytes, and macs columns.
latency_ms
float
required
Measured inference latency in milliseconds from actual benchmarking. Used to calculate achieved bandwidth.
memory_bandwidth_gbps
float
required
Theoretical peak memory bandwidth of the target device in gigabytes per second. Reference values:
  • Raspberry Pi 4: ~4.3 GB/s (LPDDR4)
  • NVIDIA Jetson Nano: ~25.6 GB/s
  • Desktop DDR4: ~40-60 GB/s

Returns

Type: dict[str, float] A dictionary containing hardware performance metrics:
estimated_total_bytes
float
Sum of all parameter bytes and activation bytes across layers. Represents total data movement.
estimated_total_macs
float
Sum of multiply-accumulate operations across all layers.
achieved_bandwidth_gbps
float
Actual bandwidth achieved during inference: (total_bytes / latency_s) / 1e9
configured_memory_bandwidth_gbps
float
The theoretical bandwidth value you provided (echoed for reference).
bandwidth_utilization
float
Ratio of achieved to theoretical bandwidth: achieved_bandwidth_gbps / configured_memory_bandwidth_gbps. Values range from 0.0 to 1.0+.
achieved_gmacs
float
Billions of MAC operations per second: (total_macs / latency_s) / 1e9

Example Usage

import torch
from edge_opt.model import SmallCNN
from edge_opt.hardware import estimate_layerwise_stats, summarize_hardware
from edge_opt.metrics import collect_metrics
from torch.utils.data import DataLoader, TensorDataset

# Setup
model = SmallCNN()
device = torch.device("cpu")
dummy_data = TensorDataset(
    torch.randn(100, 1, 28, 28),
    torch.randint(0, 10, (100,))
)
val_loader = DataLoader(dummy_data, batch_size=32)

# Collect metrics
metrics = collect_metrics(
    model, val_loader, device,
    power_watts=2.0,
    precision="fp32",
    latency_multiplier=1.0,
    benchmark_repeats=5
)

# Analyze hardware characteristics
layerwise_df = estimate_layerwise_stats(model, batch_size=32)
summary = summarize_hardware(
    layerwise_df,
    latency_ms=metrics.latency_ms,
    memory_bandwidth_gbps=4.3  # Raspberry Pi 4
)

print(f"Achieved bandwidth: {summary['achieved_bandwidth_gbps']:.2f} GB/s")
print(f"Utilization: {summary['bandwidth_utilization']*100:.1f}%")
print(f"Compute throughput: {summary['achieved_gmacs']:.2f} GMAC/s")
Sample output:
Achieved bandwidth: 1.82 GB/s
Utilization: 42.3%
Compute throughput: 3.47 GMAC/s

Interpreting Results

Bandwidth Utilization Patterns

Memory-bound execution. Your model spends most time moving data rather than computing. Optimization strategies:
  • Reduce activation memory through channel pruning
  • Increase compute intensity (deeper layers, fewer memory transfers)
  • Use lower precision formats (FP16, INT8) to reduce data movement
  • Enable memory access optimizations (tiling, prefetching)
Balanced workload. Both compute and memory contribute to latency. Consider:
  • Profile individual layers to find specific bottlenecks
  • Experiment with batch size adjustments
  • Test different precision modes for accuracy vs speed tradeoffs
Compute-bound or inefficient access. Possible causes:
  • Small model with low memory requirements
  • CPU compute limitations (low frequency, few cores)
  • Cache thrashing or poor memory access patterns
  • Framework overhead dominating execution time
Consider model quantization or hardware acceleration.

Implementation Details

The function performs these calculations:
# From src/edge_opt/hardware.py:78-91
total_bytes = float(
    layerwise_df["parameter_bytes"].sum() + 
    layerwise_df["activation_bytes"].sum()
)
total_macs = float(layerwise_df["macs"].sum())
latency_s = max(latency_ms / 1000.0, 1e-9)  # Prevent division by zero

achieved_bandwidth_gbps = (total_bytes / latency_s) / 1e9
bandwidth_utilization = achieved_bandwidth_gbps / max(memory_bandwidth_gbps, 1e-9)
achieved_gmacs = (total_macs / latency_s) / 1e9
The 1e-9 clamps prevent division by zero but mean extremely fast inference (<1 microsecond) or zero-bandwidth configurations will produce artificially clamped results.

Common Scenarios

Comparing Device Candidates

devices = [
    {"name": "Raspberry Pi 4", "bandwidth_gbps": 4.3},
    {"name": "Jetson Nano", "bandwidth_gbps": 25.6},
    {"name": "Desktop CPU", "bandwidth_gbps": 50.0},
]

for device in devices:
    summary = summarize_hardware(
        layerwise_df,
        latency_ms=metrics.latency_ms,
        memory_bandwidth_gbps=device["bandwidth_gbps"]
    )
    print(f"{device['name']}: {summary['bandwidth_utilization']*100:.1f}% utilization")

Tracking Optimization Impact

baseline_summary = summarize_hardware(baseline_df, latency_ms=100, memory_bandwidth_gbps=4.3)
pruned_summary = summarize_hardware(pruned_df, latency_ms=75, memory_bandwidth_gbps=4.3)

print(f"Baseline: {baseline_summary['achieved_bandwidth_gbps']:.2f} GB/s")
print(f"Pruned: {pruned_summary['achieved_bandwidth_gbps']:.2f} GB/s")
print(f"Reduction: {(1 - pruned_summary['estimated_total_bytes']/baseline_summary['estimated_total_bytes'])*100:.1f}%")

Pipeline Integration

In scripts/run_pipeline.py:83-87, hardware summaries are computed for the baseline model:
hardware_summary = summarize_hardware(
    layerwise_df,
    latency_ms=baseline_metrics.latency_ms,
    memory_bandwidth_gbps=cfg.memory_bandwidth_gbps,
)
Results are:
  1. Saved to outputs/hardware_summary.csv
  2. Merged into the deployment summary dictionary
  3. Included in the final JSON output
The summary provides context for whether latency improvements should focus on memory optimization (pruning, quantization) or compute acceleration (faster CPU, hardware accelerators).

Limitations

Simplifying assumptions:
  • Treats all memory accesses as sequential (ignores caching effects)
  • Assumes single-threaded execution
  • Does not account for framework overhead or kernel launch costs
  • Parameter bytes are counted once (doesn’t model reuse across batches)
For production analysis, consider detailed profiling tools like PyTorch Profiler, NVIDIA Nsight, or ARM Streamline.

Layer-wise Analysis

Generate the input DataFrame for bandwidth calculations

Performance Metrics

Collect latency measurements for bandwidth estimation

Source Reference

Implementation: src/edge_opt/hardware.py:73-91

Build docs developers (and LLMs) love