Hardware Constraints - Edge AI Hardware Optimization

Overview

The Edge AI Hardware Optimization framework implements hardware-aware constraint modeling to simulate edge device limitations. The system tracks:

Memory Budgets: SRAM-style memory constraints for model footprint validation
Bandwidth Utilization: Estimated memory bandwidth consumption from parameter and activation transfers
CPU Frequency Scaling: Latency adjustments to model lower-frequency edge processors
Layer-wise Resource Analysis: Per-layer activation memory and compute (MACs) breakdown

These constraints are implemented in src/edge_opt/hardware.py and src/edge_opt/metrics.py.

Memory Budget Constraints

Configuration

Memory budgets are specified in the experiment configuration YAML:

memory_budgets_mb: [0.5, 1.0, 2.0, 5.0]  # Reporting thresholds
active_memory_budget_mb: 1.0              # Hard acceptance threshold

memory_budgets_mb: List of budget thresholds for violation reporting (soft constraints)
active_memory_budget_mb: Single threshold for accept/reject classification (hard constraint)

The active memory budget acts as the primary constraint filter. Candidates exceeding this limit are marked as rejected and excluded from Pareto frontier generation.

Memory Footprint Calculation

The metrics.model_memory_mb function computes model size from the state dictionary:

def model_memory_mb(model: nn.Module) -> float:
    """
    Calculate total model memory footprint in megabytes.
    
    Args:
        model: PyTorch model instance
    
    Returns:
        Total memory in MB (parameters + buffers)
    """
    total_bytes = 0
    for tensor in model.state_dict().values():
        if isinstance(tensor, torch.Tensor):
            total_bytes += tensor.numel() * tensor.element_size()
    return total_bytes / (1024 ** 2)

Implementation details:

tensor.numel(): Number of elements in the tensor
tensor.element_size(): Bytes per element (4 for FP32, 2 for FP16, 1 for INT8)
Includes all parameters (weights, biases) and buffers (running stats, etc.)
Does not include activation memory (see layer-wise analysis below)

Violation Detection

The metrics.memory_violations function checks model size against all configured budgets:

def memory_violations(memory_mb: float, budgets_mb: list[float]) -> dict[str, bool]:
    """
    Generate per-budget violation flags.
    
    Args:
        memory_mb: Model memory footprint
        budgets_mb: List of budget thresholds
    
    Returns:
        Dictionary mapping "violates_{budget}mb" -> bool
    """
    return {f"violates_{budget}mb": memory_mb > budget for budget in budgets_mb}

Example output for a 1.2 MB model with budgets_mb=[0.5, 1.0, 2.0, 5.0]:

{
    "violates_0.5mb": True,   # 1.2 > 0.5
    "violates_1.0mb": True,   # 1.2 > 1.0
    "violates_2.0mb": False,  # 1.2 < 2.0
    "violates_5.0mb": False   # 1.2 < 5.0
}

Constraint Filtering in Sweep

During the optimization sweep, each candidate is evaluated against the active budget:

# From experiments.run_sweep (src/edge_opt/experiments.py:76)
metrics: PerfMetrics = collect_metrics(
    variant, val_loader, device,
    power_watts=power_watts,
    precision=metric_precision,
    latency_multiplier=latency_multiplier,
    benchmark_repeats=benchmark_repeats,
)

# Check violations for all budgets (soft constraints)
violations = memory_violations(metrics.memory_mb, memory_budgets_mb)

# Check active budget (hard constraint)
rejected = metrics.memory_mb > active_memory_budget_mb

row = {
    "pruning_level": pruning,
    "precision": precision,
    "accepted": not rejected,        # Pass/fail flag
    "active_budget_mb": active_memory_budget_mb,
    **asdict(metrics),               # All performance metrics
    **violations,                    # Per-budget violation flags
}

Memory Headroom: Model size from state dict is a first-order approximation. Production deployments require additional margins for:

Activation buffers (see layer-wise analysis)
Runtime framework overhead
OS memory pressure
Concurrent workload contention

Recommended practice: Set active_memory_budget_mb to 60-80% of true device SRAM capacity.

Layer-wise Resource Analysis

Estimation Framework

The hardware.estimate_layerwise_stats function computes per-layer resource consumption:

def estimate_layerwise_stats(
    model: nn.Module, 
    batch_size: int, 
    input_shape: tuple[int, int, int] = (1, 28, 28)
) -> pd.DataFrame:
    """
    Estimate layer-wise activation memory and compute (MACs).
    
    Args:
        model: SmallCNN instance
        batch_size: Batch size for activation calculations
        input_shape: (C, H, W) input tensor shape
    
    Returns:
        DataFrame with columns:
        - layer: Layer name
        - output_elements: Number of output activation elements
        - parameter_bytes: Layer parameter size (weights + biases)
        - activation_bytes: Output activation memory
        - macs: Multiply-accumulate operations
    """
    channels, height, width = input_shape
    bytes_per_value = 4  # FP32 assumption
    
    # Conv1 analysis
    conv1 = model.conv1
    h1, w1 = _conv2d_output_shape(height, width, kernel=3, padding=1)
    h1_pool, w1_pool = h1 // 2, w1 // 2  # MaxPool2d(2)
    
    conv1_elements = batch_size * conv1.out_channels * h1 * w1
    conv1_macs = batch_size * conv1.out_channels * h1 * w1 * conv1.in_channels * 3 * 3
    
    # Similar calculations for conv2 and classifier...

Output Shape Calculation

The _conv2d_output_shape helper computes spatial dimensions after convolution:

def _conv2d_output_shape(
    height: int, width: int, 
    kernel: int, padding: int, stride: int = 1
) -> tuple[int, int]:
    """
    Calculate Conv2d output spatial dimensions.
    
    Formula: out = (in + 2*padding - kernel) // stride + 1
    """
    out_h = (height + (2 * padding) - kernel) // stride + 1
    out_w = (width + (2 * padding) - kernel) // stride + 1
    return out_h, out_w

For SmallCNN with 28×28 input:

Conv1 (3×3, pad=1): 28×28 → 28×28 → MaxPool → 14×14
Conv2 (3×3, pad=1): 14×14 → 14×14 → MaxPool → 7×7
Classifier: 32×7×7 = 1568 → 10

Layer-wise Metrics

The function returns a DataFrame with per-layer breakdown:

layer	output_elements	parameter_bytes	activation_bytes	macs
conv1	batch × 16 × 28 × 28	(1×16×3×3 + 16) × 4	batch × 16 × 28 × 28 × 4	batch × 16 × 28 × 28 × 1 × 9
conv2	batch × 32 × 14 × 14	(16×32×3×3 + 32) × 4	batch × 32 × 14 × 14 × 4	batch × 32 × 14 × 14 × 16 × 9
classifier	batch × 10	(1568×10 + 10) × 4	batch × 10 × 4	batch × 1568 × 10

Example: Baseline Model (batch_size=64)

With SmallCNN(conv1_channels=16, conv2_channels=32) and batch_size=64:

Conv1:
- Output elements: 64 × 16 × 28 × 28 = 802,816
- Parameter bytes: (16×3×3 + 16) × 4 = 640 bytes
- Activation bytes: 802,816 × 4 = 3.05 MB
- MACs: 64 × 16 × 28 × 28 × 1 × 9 = 7,225,344
Conv2:
- Output elements: 64 × 32 × 14 × 14 = 401,408
- Parameter bytes: (16×32×3×3 + 32) × 4 = 18,560 bytes
- Activation bytes: 401,408 × 4 = 1.53 MB
- MACs: 64 × 32 × 14 × 14 × 16 × 9 = 231,211,008
Classifier:
- Output elements: 64 × 10 = 640
- Parameter bytes: (1568×10 + 10) × 4 = 62,760 bytes
- Activation bytes: 640 × 4 = 2,560 bytes
- MACs: 64 × 1568 × 10 = 1,003,520

Activation memory is reported per-layer and assumes no in-place operations or memory reuse. Actual runtime peak memory depends on framework optimizations and graph execution order.

Bandwidth Utilization

Hardware Summary Metrics

The hardware.summarize_hardware function estimates memory bandwidth consumption:

def summarize_hardware(
    layerwise_df: pd.DataFrame,
    latency_ms: float,
    memory_bandwidth_gbps: float,
) -> dict[str, float]:
    """
    Estimate bandwidth utilization from layer-wise statistics.
    
    Args:
        layerwise_df: Output from estimate_layerwise_stats
        latency_ms: Measured inference latency
        memory_bandwidth_gbps: Configured memory bandwidth (GB/s)
    
    Returns:
        Dictionary with bandwidth and compute metrics
    """
    total_bytes = float(
        layerwise_df["parameter_bytes"].sum() + 
        layerwise_df["activation_bytes"].sum()
    )
    total_macs = float(layerwise_df["macs"].sum())
    
    latency_s = max(latency_ms / 1000.0, 1e-9)  # Avoid division by zero
    
    # Achieved bandwidth: bytes moved per second
    achieved_bandwidth_gbps = (total_bytes / latency_s) / 1e9
    
    # Utilization: fraction of configured bandwidth
    bandwidth_utilization = achieved_bandwidth_gbps / max(memory_bandwidth_gbps, 1e-9)
    
    # Achieved throughput: MACs per second
    achieved_gmacs = (total_macs / latency_s) / 1e9
    
    return {
        "estimated_total_bytes": total_bytes,
        "estimated_total_macs": total_macs,
        "achieved_bandwidth_gbps": achieved_bandwidth_gbps,
        "configured_memory_bandwidth_gbps": memory_bandwidth_gbps,
        "bandwidth_utilization": bandwidth_utilization,
        "achieved_gmacs": achieved_gmacs,
    }

Bandwidth Metrics Explained

Estimated Total Bytes

Sum of parameter bytes and activation bytes across all layers. Approximates total memory traffic per inference.

Achieved Bandwidth

total_bytes / latency_seconds converted to GB/s. Represents effective memory bandwidth consumed during inference.

Bandwidth Utilization

achieved_bandwidth / configured_bandwidth. Values near 1.0 indicate memory-bound operations; low values suggest compute-bound or cache-resident workloads.

Achieved GMAC/s

total_macs / latency_seconds in billions. Indicates computational throughput. Compare with theoretical peak GFLOP/s to assess hardware efficiency.

Estimation Limitations:

Does not account for cache-miss penalties or prefetching
Assumes no memory reuse or in-place operations
Kernel launch overhead and context switching are not modeled
L1/L2/L3 cache effects are not captured

Use these metrics for relative comparison across configurations, not absolute hardware validation.

CPU Frequency Scaling

Latency Multiplier

The cpu_frequency_scale configuration parameter models lower-frequency edge processors:

cpu_frequency_scale: 2.0  # Simulate 50% frequency (2x latency)

This multiplier is applied during metric collection:

# From metrics.collect_metrics (src/edge_opt/metrics.py:86)
latency_mean, latency_std, latency_p95 = measure_latency_distribution(
    model, sample_input, repeats=benchmark_repeats
)

# Apply frequency scaling
latency = latency_mean * latency_multiplier  # latency_multiplier = cpu_frequency_scale
throughput = sample_input.shape[0] / (latency / 1000.0)
energy_proxy = (latency / 1000.0) * power_watts

return PerfMetrics(
    accuracy=accuracy,
    latency_ms=latency,
    latency_std_ms=latency_std * latency_multiplier,
    latency_p95_ms=latency_p95 * latency_multiplier,
    throughput_sps=throughput,  # Inversely affected by latency
    memory_mb=memory,
    energy_proxy_j=energy_proxy,  # Linearly affected by latency
)

Frequency Scaling Assumptions:

Linear latency scaling: latency_scaled = latency_base × scale
Assumes memory-bound operations scale proportionally
Does not model voltage scaling effects on power consumption
Ideal for comparing relative performance at different frequencies

Latency Distribution Measurement

The metrics.measure_latency_distribution function captures statistical latency behavior:

def measure_latency_distribution(
    model: nn.Module, 
    sample_input: torch.Tensor, 
    repeats: int = 5, 
    num_runs: int = 100, 
    warmup: int = 10
) -> tuple[float, float, float]:
    """
    Measure latency statistics across multiple measurement windows.
    
    Args:
        model: Model in evaluation mode
        sample_input: Example input tensor
        repeats: Number of measurement windows
        num_runs: Iterations per measurement window
        warmup: Warmup iterations before each window
    
    Returns:
        (mean_ms, std_ms, p95_ms)
    """
    latencies = [
        measure_latency(model, sample_input, num_runs=num_runs, warmup=warmup) 
        for _ in range(repeats)
    ]
    latency_tensor = torch.tensor(latencies, dtype=torch.float32)
    return (
        float(latency_tensor.mean()), 
        float(latency_tensor.std(unbiased=False)), 
        float(torch.quantile(latency_tensor, 0.95))
    )

Measurement protocol:

Run warmup iterations to stabilize CPU cache and frequency scaling
Measure num_runs iterations and compute average
Repeat measurement repeats times to capture variability
Report mean, standard deviation, and 95th percentile

Default Measurement Settings

From configs/default.yaml defaults:

benchmark_repeats: 5  # Number of measurement windows

Hard-coded in measure_latency_distribution:

num_runs: 100 iterations per window
warmup: 10 iterations before each window

Total inference calls per candidate: (100 + 10) × 5 = 550

Hardware Artifacts

Output Files

The hardware.save_hardware_artifacts function generates analysis reports:

def save_hardware_artifacts(
    output_dir: Path,
    layerwise_df: pd.DataFrame,
    precision_df: pd.DataFrame,
    summary: dict[str, float],
) -> None:
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # CSV reports
    layerwise_df.to_csv(output_dir / "layerwise_breakdown.csv", index=False)
    precision_df.to_csv(output_dir / "precision_tradeoffs.csv", index=False)
    pd.DataFrame([summary]).to_csv(output_dir / "hardware_summary.csv", index=False)
    
    # Visualization plots (see below)
    ...

layerwise_breakdown.csv

Per-layer resource consumption:

layer	output_elements	parameter_bytes	activation_bytes	macs
conv1	802816	640	3211264	7225344
conv2	401408	18560	1605632	231211008
classifier	640	62760	2560	1003520

precision_tradeoffs.csv

Aggregated metrics by precision mode:

precision	accuracy_mean	latency_ms_mean	memory_mb_mean	energy_proxy_j_mean	accepted_ratio
int8	0.972	2.45	0.21	0.0049	1.00
fp16	0.985	3.12	0.42	0.0062	0.75
fp32	0.987	4.87	0.84	0.0097	0.25

hardware_summary.csv

Bandwidth and compute utilization:

estimated_total_bytes	estimated_total_macs	achieved_bandwidth_gbps	configured_memory_bandwidth_gbps	bandwidth_utilization	achieved_gmacs
4878656	239439872	1.002	12.8	0.078	49.156

Visualization Plots

The artifact generation includes three plots:

Layer-wise Activation Memory

Bar chart showing activation memory (MB) per layer. Highlights memory-intensive layers for optimization targeting.File: layerwise_activation_memory.png

Layer-wise Compute (MACs)

Bar chart showing multiply-accumulate operations (millions) per layer. Identifies compute bottlenecks.File: layerwise_macs.png

Best Practices

Set Realistic Budgets

Configure active_memory_budget_mb to 60-80% of true device SRAM to account for runtime overhead and activation buffers.

Monitor Acceptance Ratios

Check precision_tradeoffs.csv to ensure sufficient candidates pass budgets. Ratios below 0.5 indicate overly strict constraints.

Validate Bandwidth Estimates

Low bandwidth utilization (<0.1) suggests compute-bound or cache-resident workloads. High utilization (>0.8) indicates memory-bound operations.

Compare Relative Performance

Use hardware metrics for relative comparison across configurations. Validate absolute numbers with hardware counters on target devices.

Limitations and Caveats

Estimation-Based Analysis: Hardware metrics are software-level estimates derived from shape calculations and measured latency. They do not replace performance monitoring unit (PMU) profiling or silicon validation.

What is NOT modeled

Cache effects: L1/L2/L3 cache hits, misses, and prefetching
Memory reuse: In-place operations and activation buffer recycling
Kernel overhead: Operator dispatch, context switching, and synchronization
Hardware counters: Actual memory transactions, instruction counts, and stall cycles
Thermal throttling: Dynamic frequency scaling due to thermal limits
Co-scheduled workloads: Host-level contention from concurrent processes

INT8 Backend Variability

The default fbgemm backend targets x86 CPUs with AVX-512 VNNI support. Performance on other platforms:

ARM CPUs: Use qnnpack backend (requires config change)
Older x86: May fall back to slower INT8 emulation
Accelerators: GPU/NPU quantization requires different quantization APIs

Backend selection significantly affects INT8 latency and energy estimates.

Activation Memory vs Peak Memory

Layer-wise activation bytes report output footprint only, not peak runtime memory:

Frameworks may reuse buffers across layers
Gradient storage is not applicable (inference only)
Temporary buffers for operator fusion are not tracked

Actual peak memory can be measured with torch.cuda.max_memory_allocated() on GPU or OS-level profiling tools on CPU.

Get Started

Core Concepts

Guides

Hardware Analysis

Deployment

​Overview

​Memory Budget Constraints

​Configuration

​Memory Footprint Calculation

​Violation Detection

​Constraint Filtering in Sweep

​Layer-wise Resource Analysis

​Estimation Framework

​Output Shape Calculation

​Layer-wise Metrics

​Bandwidth Utilization

​Hardware Summary Metrics

​Bandwidth Metrics Explained

Estimated Total Bytes

Achieved Bandwidth

Bandwidth Utilization

Achieved GMAC/s

​CPU Frequency Scaling

​Latency Multiplier

​Latency Distribution Measurement

​Hardware Artifacts

​Output Files

​Visualization Plots

Layer-wise Activation Memory

Layer-wise Compute (MACs)

​Best Practices

Set Realistic Budgets

Monitor Acceptance Ratios

Validate Bandwidth Estimates

Compare Relative Performance

​Limitations and Caveats

​Next Steps

Model Optimization

System Architecture

Build docs developers (and LLMs) love

Overview

Memory Budget Constraints

Configuration

Memory Footprint Calculation

Violation Detection

Constraint Filtering in Sweep

Layer-wise Resource Analysis

Estimation Framework

Output Shape Calculation

Layer-wise Metrics

Bandwidth Utilization

Hardware Summary Metrics

Bandwidth Metrics Explained

CPU Frequency Scaling

Latency Multiplier

Latency Distribution Measurement

Hardware Artifacts

Output Files

Visualization Plots

Best Practices

Limitations and Caveats

Next Steps