Bandwidth Utilization

Overview

The summarize_hardware function calculates achieved memory bandwidth by dividing total data movement (parameters + activations) by inference latency. Comparing this against your device’s theoretical bandwidth reveals whether your model is bottlenecked by memory transfers or compute operations.

Bandwidth utilization above 80% typically indicates memory-bound execution. Values below 20% suggest compute-bound or inefficient memory access patterns.

Function Signature

from edge_opt.hardware import summarize_hardware

hardware_summary = summarize_hardware(
    layerwise_df=layerwise_df,
    latency_ms=baseline_metrics.latency_ms,
    memory_bandwidth_gbps=cfg.memory_bandwidth_gbps
)

Parameters

layerwise_df

pd.DataFrame

required

DataFrame from estimate_layerwise_stats containing per-layer memory and compute estimates. Must have parameter_bytes, activation_bytes, and macs columns.

latency_ms

float

required

Measured inference latency in milliseconds from actual benchmarking. Used to calculate achieved bandwidth.

memory_bandwidth_gbps

float

required

Theoretical peak memory bandwidth of the target device in gigabytes per second. Reference values:

Raspberry Pi 4: ~4.3 GB/s (LPDDR4)
NVIDIA Jetson Nano: ~25.6 GB/s
Desktop DDR4: ~40-60 GB/s

Returns

Type: dict[str, float] A dictionary containing hardware performance metrics:

estimated_total_bytes

float

Sum of all parameter bytes and activation bytes across layers. Represents total data movement.

estimated_total_macs

float

Sum of multiply-accumulate operations across all layers.

achieved_bandwidth_gbps

float

Actual bandwidth achieved during inference: (total_bytes / latency_s) / 1e9

configured_memory_bandwidth_gbps

float

The theoretical bandwidth value you provided (echoed for reference).

bandwidth_utilization

float

Ratio of achieved to theoretical bandwidth: achieved_bandwidth_gbps / configured_memory_bandwidth_gbps. Values range from 0.0 to 1.0+.

achieved_gmacs

float

Billions of MAC operations per second: (total_macs / latency_s) / 1e9

Example Usage

import torch
from edge_opt.model import SmallCNN
from edge_opt.hardware import estimate_layerwise_stats, summarize_hardware
from edge_opt.metrics import collect_metrics
from torch.utils.data import DataLoader, TensorDataset

# Setup
model = SmallCNN()
device = torch.device("cpu")
dummy_data = TensorDataset(
    torch.randn(100, 1, 28, 28),
    torch.randint(0, 10, (100,))
)
val_loader = DataLoader(dummy_data, batch_size=32)

# Collect metrics
metrics = collect_metrics(
    model, val_loader, device,
    power_watts=2.0,
    precision="fp32",
    latency_multiplier=1.0,
    benchmark_repeats=5
)

# Analyze hardware characteristics
layerwise_df = estimate_layerwise_stats(model, batch_size=32)
summary = summarize_hardware(
    layerwise_df,
    latency_ms=metrics.latency_ms,
    memory_bandwidth_gbps=4.3  # Raspberry Pi 4
)

print(f"Achieved bandwidth: {summary['achieved_bandwidth_gbps']:.2f} GB/s")
print(f"Utilization: {summary['bandwidth_utilization']*100:.1f}%")
print(f"Compute throughput: {summary['achieved_gmacs']:.2f} GMAC/s")

Sample output:

Achieved bandwidth: 1.82 GB/s
Utilization: 42.3%
Compute throughput: 3.47 GMAC/s

Interpreting Results

Bandwidth Utilization Patterns

High utilization (>70%)

Memory-bound execution. Your model spends most time moving data rather than computing. Optimization strategies:

Reduce activation memory through channel pruning
Increase compute intensity (deeper layers, fewer memory transfers)
Use lower precision formats (FP16, INT8) to reduce data movement
Enable memory access optimizations (tiling, prefetching)

Moderate utilization (30-70%)

Balanced workload. Both compute and memory contribute to latency. Consider:

Profile individual layers to find specific bottlenecks
Experiment with batch size adjustments
Test different precision modes for accuracy vs speed tradeoffs

Low utilization (<30%)

Compute-bound or inefficient access. Possible causes:

Small model with low memory requirements
CPU compute limitations (low frequency, few cores)
Cache thrashing or poor memory access patterns
Framework overhead dominating execution time

Consider model quantization or hardware acceleration.

Implementation Details

The function performs these calculations:

# From src/edge_opt/hardware.py:78-91
total_bytes = float(
    layerwise_df["parameter_bytes"].sum() + 
    layerwise_df["activation_bytes"].sum()
)
total_macs = float(layerwise_df["macs"].sum())
latency_s = max(latency_ms / 1000.0, 1e-9)  # Prevent division by zero

achieved_bandwidth_gbps = (total_bytes / latency_s) / 1e9
bandwidth_utilization = achieved_bandwidth_gbps / max(memory_bandwidth_gbps, 1e-9)
achieved_gmacs = (total_macs / latency_s) / 1e9

The 1e-9 clamps prevent division by zero but mean extremely fast inference (<1 microsecond) or zero-bandwidth configurations will produce artificially clamped results.

Common Scenarios

Comparing Device Candidates

devices = [
    {"name": "Raspberry Pi 4", "bandwidth_gbps": 4.3},
    {"name": "Jetson Nano", "bandwidth_gbps": 25.6},
    {"name": "Desktop CPU", "bandwidth_gbps": 50.0},
]

for device in devices:
    summary = summarize_hardware(
        layerwise_df,
        latency_ms=metrics.latency_ms,
        memory_bandwidth_gbps=device["bandwidth_gbps"]
    )
    print(f"{device['name']}: {summary['bandwidth_utilization']*100:.1f}% utilization")

Tracking Optimization Impact

baseline_summary = summarize_hardware(baseline_df, latency_ms=100, memory_bandwidth_gbps=4.3)
pruned_summary = summarize_hardware(pruned_df, latency_ms=75, memory_bandwidth_gbps=4.3)

print(f"Baseline: {baseline_summary['achieved_bandwidth_gbps']:.2f} GB/s")
print(f"Pruned: {pruned_summary['achieved_bandwidth_gbps']:.2f} GB/s")
print(f"Reduction: {(1 - pruned_summary['estimated_total_bytes']/baseline_summary['estimated_total_bytes'])*100:.1f}%")

Pipeline Integration

In scripts/run_pipeline.py:83-87, hardware summaries are computed for the baseline model:

hardware_summary = summarize_hardware(
    layerwise_df,
    latency_ms=baseline_metrics.latency_ms,
    memory_bandwidth_gbps=cfg.memory_bandwidth_gbps,
)

Results are:

Saved to outputs/hardware_summary.csv
Merged into the deployment summary dictionary
Included in the final JSON output

The summary provides context for whether latency improvements should focus on memory optimization (pruning, quantization) or compute acceleration (faster CPU, hardware accelerators).

Limitations

Simplifying assumptions:

Treats all memory accesses as sequential (ignores caching effects)
Assumes single-threaded execution
Does not account for framework overhead or kernel launch costs
Parameter bytes are counted once (doesn’t model reuse across batches)

For production analysis, consider detailed profiling tools like PyTorch Profiler, NVIDIA Nsight, or ARM Streamline.

Layer-wise Analysis

Generate the input DataFrame for bandwidth calculations

Performance Metrics

Collect latency measurements for bandwidth estimation

Source Reference

Implementation: src/edge_opt/hardware.py:73-91

Get Started

Core Concepts

Guides

Hardware Analysis

Deployment

Overview

Function Signature

Parameters

Returns

Example Usage

Interpreting Results

Bandwidth Utilization Patterns

Implementation Details

Common Scenarios

Comparing Device Candidates

Tracking Optimization Impact

Pipeline Integration

Limitations

Layer-wise Analysis

Performance Metrics

Source Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Hardware Analysis

Deployment

​Overview

​Function Signature

​Parameters

​Returns

​Example Usage

​Interpreting Results

​Bandwidth Utilization Patterns

​Implementation Details

​Common Scenarios

​Comparing Device Candidates

​Tracking Optimization Impact

​Pipeline Integration

​Limitations

​Related Functions

Layer-wise Analysis

Performance Metrics

​Source Reference

Build docs developers (and LLMs) love

Overview

Function Signature

Parameters

Returns

Example Usage

Interpreting Results

Bandwidth Utilization Patterns

Implementation Details

Common Scenarios

Comparing Device Candidates

Tracking Optimization Impact

Pipeline Integration

Limitations

Related Functions

Source Reference