Skip to main content

Overview

The Edge AI Hardware Optimization framework implements hardware-aware constraint modeling to simulate edge device limitations. The system tracks:
  1. Memory Budgets: SRAM-style memory constraints for model footprint validation
  2. Bandwidth Utilization: Estimated memory bandwidth consumption from parameter and activation transfers
  3. CPU Frequency Scaling: Latency adjustments to model lower-frequency edge processors
  4. Layer-wise Resource Analysis: Per-layer activation memory and compute (MACs) breakdown
These constraints are implemented in src/edge_opt/hardware.py and src/edge_opt/metrics.py.

Memory Budget Constraints

Configuration

Memory budgets are specified in the experiment configuration YAML:
memory_budgets_mb: [0.5, 1.0, 2.0, 5.0]  # Reporting thresholds
active_memory_budget_mb: 1.0              # Hard acceptance threshold
  • memory_budgets_mb: List of budget thresholds for violation reporting (soft constraints)
  • active_memory_budget_mb: Single threshold for accept/reject classification (hard constraint)
The active memory budget acts as the primary constraint filter. Candidates exceeding this limit are marked as rejected and excluded from Pareto frontier generation.

Memory Footprint Calculation

The metrics.model_memory_mb function computes model size from the state dictionary:
def model_memory_mb(model: nn.Module) -> float:
    """
    Calculate total model memory footprint in megabytes.
    
    Args:
        model: PyTorch model instance
    
    Returns:
        Total memory in MB (parameters + buffers)
    """
    total_bytes = 0
    for tensor in model.state_dict().values():
        if isinstance(tensor, torch.Tensor):
            total_bytes += tensor.numel() * tensor.element_size()
    return total_bytes / (1024 ** 2)
Implementation details:
  • tensor.numel(): Number of elements in the tensor
  • tensor.element_size(): Bytes per element (4 for FP32, 2 for FP16, 1 for INT8)
  • Includes all parameters (weights, biases) and buffers (running stats, etc.)
  • Does not include activation memory (see layer-wise analysis below)

Violation Detection

The metrics.memory_violations function checks model size against all configured budgets:
def memory_violations(memory_mb: float, budgets_mb: list[float]) -> dict[str, bool]:
    """
    Generate per-budget violation flags.
    
    Args:
        memory_mb: Model memory footprint
        budgets_mb: List of budget thresholds
    
    Returns:
        Dictionary mapping "violates_{budget}mb" -> bool
    """
    return {f"violates_{budget}mb": memory_mb > budget for budget in budgets_mb}
Example output for a 1.2 MB model with budgets_mb=[0.5, 1.0, 2.0, 5.0]:
{
    "violates_0.5mb": True,   # 1.2 > 0.5
    "violates_1.0mb": True,   # 1.2 > 1.0
    "violates_2.0mb": False,  # 1.2 < 2.0
    "violates_5.0mb": False   # 1.2 < 5.0
}

Constraint Filtering in Sweep

During the optimization sweep, each candidate is evaluated against the active budget:
# From experiments.run_sweep (src/edge_opt/experiments.py:76)
metrics: PerfMetrics = collect_metrics(
    variant, val_loader, device,
    power_watts=power_watts,
    precision=metric_precision,
    latency_multiplier=latency_multiplier,
    benchmark_repeats=benchmark_repeats,
)

# Check violations for all budgets (soft constraints)
violations = memory_violations(metrics.memory_mb, memory_budgets_mb)

# Check active budget (hard constraint)
rejected = metrics.memory_mb > active_memory_budget_mb

row = {
    "pruning_level": pruning,
    "precision": precision,
    "accepted": not rejected,        # Pass/fail flag
    "active_budget_mb": active_memory_budget_mb,
    **asdict(metrics),               # All performance metrics
    **violations,                    # Per-budget violation flags
}
Memory Headroom: Model size from state dict is a first-order approximation. Production deployments require additional margins for:
  • Activation buffers (see layer-wise analysis)
  • Runtime framework overhead
  • OS memory pressure
  • Concurrent workload contention
Recommended practice: Set active_memory_budget_mb to 60-80% of true device SRAM capacity.

Layer-wise Resource Analysis

Estimation Framework

The hardware.estimate_layerwise_stats function computes per-layer resource consumption:
def estimate_layerwise_stats(
    model: nn.Module, 
    batch_size: int, 
    input_shape: tuple[int, int, int] = (1, 28, 28)
) -> pd.DataFrame:
    """
    Estimate layer-wise activation memory and compute (MACs).
    
    Args:
        model: SmallCNN instance
        batch_size: Batch size for activation calculations
        input_shape: (C, H, W) input tensor shape
    
    Returns:
        DataFrame with columns:
        - layer: Layer name
        - output_elements: Number of output activation elements
        - parameter_bytes: Layer parameter size (weights + biases)
        - activation_bytes: Output activation memory
        - macs: Multiply-accumulate operations
    """
    channels, height, width = input_shape
    bytes_per_value = 4  # FP32 assumption
    
    # Conv1 analysis
    conv1 = model.conv1
    h1, w1 = _conv2d_output_shape(height, width, kernel=3, padding=1)
    h1_pool, w1_pool = h1 // 2, w1 // 2  # MaxPool2d(2)
    
    conv1_elements = batch_size * conv1.out_channels * h1 * w1
    conv1_macs = batch_size * conv1.out_channels * h1 * w1 * conv1.in_channels * 3 * 3
    
    # Similar calculations for conv2 and classifier...

Output Shape Calculation

The _conv2d_output_shape helper computes spatial dimensions after convolution:
def _conv2d_output_shape(
    height: int, width: int, 
    kernel: int, padding: int, stride: int = 1
) -> tuple[int, int]:
    """
    Calculate Conv2d output spatial dimensions.
    
    Formula: out = (in + 2*padding - kernel) // stride + 1
    """
    out_h = (height + (2 * padding) - kernel) // stride + 1
    out_w = (width + (2 * padding) - kernel) // stride + 1
    return out_h, out_w
For SmallCNN with 28×28 input:
  • Conv1 (3×3, pad=1): 28×28 → 28×28 → MaxPool → 14×14
  • Conv2 (3×3, pad=1): 14×14 → 14×14 → MaxPool → 7×7
  • Classifier: 32×7×7 = 1568 → 10

Layer-wise Metrics

The function returns a DataFrame with per-layer breakdown:
layeroutput_elementsparameter_bytesactivation_bytesmacs
conv1batch × 16 × 28 × 28(1×16×3×3 + 16) × 4batch × 16 × 28 × 28 × 4batch × 16 × 28 × 28 × 1 × 9
conv2batch × 32 × 14 × 14(16×32×3×3 + 32) × 4batch × 32 × 14 × 14 × 4batch × 32 × 14 × 14 × 16 × 9
classifierbatch × 10(1568×10 + 10) × 4batch × 10 × 4batch × 1568 × 10
With SmallCNN(conv1_channels=16, conv2_channels=32) and batch_size=64:
  • Conv1:
    • Output elements: 64 × 16 × 28 × 28 = 802,816
    • Parameter bytes: (16×3×3 + 16) × 4 = 640 bytes
    • Activation bytes: 802,816 × 4 = 3.05 MB
    • MACs: 64 × 16 × 28 × 28 × 1 × 9 = 7,225,344
  • Conv2:
    • Output elements: 64 × 32 × 14 × 14 = 401,408
    • Parameter bytes: (16×32×3×3 + 32) × 4 = 18,560 bytes
    • Activation bytes: 401,408 × 4 = 1.53 MB
    • MACs: 64 × 32 × 14 × 14 × 16 × 9 = 231,211,008
  • Classifier:
    • Output elements: 64 × 10 = 640
    • Parameter bytes: (1568×10 + 10) × 4 = 62,760 bytes
    • Activation bytes: 640 × 4 = 2,560 bytes
    • MACs: 64 × 1568 × 10 = 1,003,520
Activation memory is reported per-layer and assumes no in-place operations or memory reuse. Actual runtime peak memory depends on framework optimizations and graph execution order.

Bandwidth Utilization

Hardware Summary Metrics

The hardware.summarize_hardware function estimates memory bandwidth consumption:
def summarize_hardware(
    layerwise_df: pd.DataFrame,
    latency_ms: float,
    memory_bandwidth_gbps: float,
) -> dict[str, float]:
    """
    Estimate bandwidth utilization from layer-wise statistics.
    
    Args:
        layerwise_df: Output from estimate_layerwise_stats
        latency_ms: Measured inference latency
        memory_bandwidth_gbps: Configured memory bandwidth (GB/s)
    
    Returns:
        Dictionary with bandwidth and compute metrics
    """
    total_bytes = float(
        layerwise_df["parameter_bytes"].sum() + 
        layerwise_df["activation_bytes"].sum()
    )
    total_macs = float(layerwise_df["macs"].sum())
    
    latency_s = max(latency_ms / 1000.0, 1e-9)  # Avoid division by zero
    
    # Achieved bandwidth: bytes moved per second
    achieved_bandwidth_gbps = (total_bytes / latency_s) / 1e9
    
    # Utilization: fraction of configured bandwidth
    bandwidth_utilization = achieved_bandwidth_gbps / max(memory_bandwidth_gbps, 1e-9)
    
    # Achieved throughput: MACs per second
    achieved_gmacs = (total_macs / latency_s) / 1e9
    
    return {
        "estimated_total_bytes": total_bytes,
        "estimated_total_macs": total_macs,
        "achieved_bandwidth_gbps": achieved_bandwidth_gbps,
        "configured_memory_bandwidth_gbps": memory_bandwidth_gbps,
        "bandwidth_utilization": bandwidth_utilization,
        "achieved_gmacs": achieved_gmacs,
    }

Bandwidth Metrics Explained

Estimated Total Bytes

Sum of parameter bytes and activation bytes across all layers. Approximates total memory traffic per inference.

Achieved Bandwidth

total_bytes / latency_seconds converted to GB/s. Represents effective memory bandwidth consumed during inference.

Bandwidth Utilization

achieved_bandwidth / configured_bandwidth. Values near 1.0 indicate memory-bound operations; low values suggest compute-bound or cache-resident workloads.

Achieved GMAC/s

total_macs / latency_seconds in billions. Indicates computational throughput. Compare with theoretical peak GFLOP/s to assess hardware efficiency.
Estimation Limitations:
  • Does not account for cache-miss penalties or prefetching
  • Assumes no memory reuse or in-place operations
  • Kernel launch overhead and context switching are not modeled
  • L1/L2/L3 cache effects are not captured
Use these metrics for relative comparison across configurations, not absolute hardware validation.

CPU Frequency Scaling

Latency Multiplier

The cpu_frequency_scale configuration parameter models lower-frequency edge processors:
cpu_frequency_scale: 2.0  # Simulate 50% frequency (2x latency)
This multiplier is applied during metric collection:
# From metrics.collect_metrics (src/edge_opt/metrics.py:86)
latency_mean, latency_std, latency_p95 = measure_latency_distribution(
    model, sample_input, repeats=benchmark_repeats
)

# Apply frequency scaling
latency = latency_mean * latency_multiplier  # latency_multiplier = cpu_frequency_scale
throughput = sample_input.shape[0] / (latency / 1000.0)
energy_proxy = (latency / 1000.0) * power_watts

return PerfMetrics(
    accuracy=accuracy,
    latency_ms=latency,
    latency_std_ms=latency_std * latency_multiplier,
    latency_p95_ms=latency_p95 * latency_multiplier,
    throughput_sps=throughput,  # Inversely affected by latency
    memory_mb=memory,
    energy_proxy_j=energy_proxy,  # Linearly affected by latency
)
Frequency Scaling Assumptions:
  • Linear latency scaling: latency_scaled = latency_base × scale
  • Assumes memory-bound operations scale proportionally
  • Does not model voltage scaling effects on power consumption
  • Ideal for comparing relative performance at different frequencies

Latency Distribution Measurement

The metrics.measure_latency_distribution function captures statistical latency behavior:
def measure_latency_distribution(
    model: nn.Module, 
    sample_input: torch.Tensor, 
    repeats: int = 5, 
    num_runs: int = 100, 
    warmup: int = 10
) -> tuple[float, float, float]:
    """
    Measure latency statistics across multiple measurement windows.
    
    Args:
        model: Model in evaluation mode
        sample_input: Example input tensor
        repeats: Number of measurement windows
        num_runs: Iterations per measurement window
        warmup: Warmup iterations before each window
    
    Returns:
        (mean_ms, std_ms, p95_ms)
    """
    latencies = [
        measure_latency(model, sample_input, num_runs=num_runs, warmup=warmup) 
        for _ in range(repeats)
    ]
    latency_tensor = torch.tensor(latencies, dtype=torch.float32)
    return (
        float(latency_tensor.mean()), 
        float(latency_tensor.std(unbiased=False)), 
        float(torch.quantile(latency_tensor, 0.95))
    )
Measurement protocol:
  1. Run warmup iterations to stabilize CPU cache and frequency scaling
  2. Measure num_runs iterations and compute average
  3. Repeat measurement repeats times to capture variability
  4. Report mean, standard deviation, and 95th percentile
From configs/default.yaml defaults:
benchmark_repeats: 5  # Number of measurement windows
Hard-coded in measure_latency_distribution:
  • num_runs: 100 iterations per window
  • warmup: 10 iterations before each window
Total inference calls per candidate: (100 + 10) × 5 = 550

Hardware Artifacts

Output Files

The hardware.save_hardware_artifacts function generates analysis reports:
def save_hardware_artifacts(
    output_dir: Path,
    layerwise_df: pd.DataFrame,
    precision_df: pd.DataFrame,
    summary: dict[str, float],
) -> None:
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # CSV reports
    layerwise_df.to_csv(output_dir / "layerwise_breakdown.csv", index=False)
    precision_df.to_csv(output_dir / "precision_tradeoffs.csv", index=False)
    pd.DataFrame([summary]).to_csv(output_dir / "hardware_summary.csv", index=False)
    
    # Visualization plots (see below)
    ...
1

layerwise_breakdown.csv

Per-layer resource consumption:
layeroutput_elementsparameter_bytesactivation_bytesmacs
conv180281664032112647225344
conv2401408185601605632231211008
classifier6406276025601003520
2

precision_tradeoffs.csv

Aggregated metrics by precision mode:
precisionaccuracy_meanlatency_ms_meanmemory_mb_meanenergy_proxy_j_meanaccepted_ratio
int80.9722.450.210.00491.00
fp160.9853.120.420.00620.75
fp320.9874.870.840.00970.25
3

hardware_summary.csv

Bandwidth and compute utilization:
estimated_total_bytesestimated_total_macsachieved_bandwidth_gbpsconfigured_memory_bandwidth_gbpsbandwidth_utilizationachieved_gmacs
48786562394398721.00212.80.07849.156

Visualization Plots

The artifact generation includes three plots:

Layer-wise Activation Memory

Bar chart showing activation memory (MB) per layer. Highlights memory-intensive layers for optimization targeting.File: layerwise_activation_memory.png

Layer-wise Compute (MACs)

Bar chart showing multiply-accumulate operations (millions) per layer. Identifies compute bottlenecks.File: layerwise_macs.png

Best Practices

Set Realistic Budgets

Configure active_memory_budget_mb to 60-80% of true device SRAM to account for runtime overhead and activation buffers.

Monitor Acceptance Ratios

Check precision_tradeoffs.csv to ensure sufficient candidates pass budgets. Ratios below 0.5 indicate overly strict constraints.

Validate Bandwidth Estimates

Low bandwidth utilization (<0.1) suggests compute-bound or cache-resident workloads. High utilization (>0.8) indicates memory-bound operations.

Compare Relative Performance

Use hardware metrics for relative comparison across configurations. Validate absolute numbers with hardware counters on target devices.

Limitations and Caveats

Estimation-Based Analysis: Hardware metrics are software-level estimates derived from shape calculations and measured latency. They do not replace performance monitoring unit (PMU) profiling or silicon validation.
  • Cache effects: L1/L2/L3 cache hits, misses, and prefetching
  • Memory reuse: In-place operations and activation buffer recycling
  • Kernel overhead: Operator dispatch, context switching, and synchronization
  • Hardware counters: Actual memory transactions, instruction counts, and stall cycles
  • Thermal throttling: Dynamic frequency scaling due to thermal limits
  • Co-scheduled workloads: Host-level contention from concurrent processes
The default fbgemm backend targets x86 CPUs with AVX-512 VNNI support. Performance on other platforms:
  • ARM CPUs: Use qnnpack backend (requires config change)
  • Older x86: May fall back to slower INT8 emulation
  • Accelerators: GPU/NPU quantization requires different quantization APIs
Backend selection significantly affects INT8 latency and energy estimates.
Layer-wise activation bytes report output footprint only, not peak runtime memory:
  • Frameworks may reuse buffers across layers
  • Gradient storage is not applicable (inference only)
  • Temporary buffers for operator fusion are not tracked
Actual peak memory can be measured with torch.cuda.max_memory_allocated() on GPU or OS-level profiling tools on CPU.

Next Steps

Model Optimization

Learn how pruning and quantization reduce memory footprint

System Architecture

Understand constraint filtering in the pipeline

Build docs developers (and LLMs) love