Skip to main content

Overview

The Edge AI Hardware Optimization framework implements a constraint-first optimization pipeline designed for evaluating compact CNN deployments under edge-device constraints. The architecture prioritizes deterministic execution, measurable trade-offs, and low-complexity implementation suitable for iterative experimentation.

Pipeline Stages

The optimization pipeline consists of seven sequential stages that transform a baseline model into deployment-ready candidates:
1

Configuration Load

The edge_opt.config module parses YAML configuration into a typed ExperimentConfig dataclass.
from edge_opt.config import load_config

config = load_config("configs/default.yaml")
# Returns ExperimentConfig with validated parameters:
# - seed, dataset, batch_size, epochs
# - pruning_levels, precisions
# - memory_budgets_mb, active_memory_budget_mb
# - cpu_frequency_scale, memory_bandwidth_gbps
The configuration loader supports scalar parsing with automatic type inference for integers, floats, booleans, lists, and None values.
2

Dataset and Loader Setup

The edge_opt.data module builds deterministic train/validation loaders with reproducible shuffling.
from edge_opt.data import build_loaders

train_loader, val_loader = build_loaders(
    dataset_name="mnist",
    batch_size=config.batch_size,
    train_subset=config.train_subset,
    val_subset=config.val_subset,
    seed=config.dataloader_seed,
    num_workers=config.num_workers
)
Supported datasets: mnist and fashion-mnist. Loaders use fixed Generator seeds to ensure reproducible batch ordering across runs.
3

Baseline Training

The edge_opt.experiments.train_model function trains the compact CNN using Adam optimizer and cross-entropy loss.
from edge_opt.model import SmallCNN
from edge_opt.experiments import train_model

base_model = SmallCNN(conv1_channels=16, conv2_channels=32)
trained_model = train_model(
    model=base_model,
    train_loader=train_loader,
    epochs=config.epochs,
    learning_rate=config.learning_rate,
    device=device
)
The SmallCNN architecture consists of:
  • Conv1: 1 → 16 channels (3x3 kernel, padding=1)
  • MaxPool + ReLU
  • Conv2: 16 → 32 channels (3x3 kernel, padding=1)
  • MaxPool + ReLU
  • Flatten + Linear classifier: 32×7×7 → 10 classes
4

Optimization Sweep

The edge_opt.experiments.run_sweep function applies pruning and precision variants across the configuration space.
from edge_opt.experiments import run_sweep

sweep_df = run_sweep(
    base_model=trained_model,
    val_loader=val_loader,
    calibration_loader=train_loader,
    device=device,
    pruning_levels=config.pruning_levels,      # e.g., [0.0, 0.3, 0.5, 0.7]
    precisions=config.precisions,              # e.g., ["fp32", "fp16", "int8"]
    power_watts=config.power_watts,
    calibration_batches=config.calibration_batches,
    memory_budgets_mb=config.memory_budgets_mb,
    active_memory_budget_mb=config.active_memory_budget_mb,
    latency_multiplier=config.cpu_frequency_scale,
    benchmark_repeats=config.benchmark_repeats
)
Sweep cardinality scales as len(pruning_levels) × len(precisions). Each candidate is evaluated independently.
5

Metric Collection

The edge_opt.metrics module computes comprehensive performance metrics for each candidate.Collected metrics:
  • Accuracy: Validation set classification accuracy
  • Latency: Mean, standard deviation, and P95 inference time (ms)
  • Throughput: Samples per second
  • Memory: Model footprint from state dict (MB)
  • Energy Proxy: latency_seconds × power_watts (J)
Latency measurements include warmup iterations to stabilize CPU cache state. Throughput is sensitive to batch size and host load.
6

Constraint Filtering

Candidates are classified by the active memory budget before Pareto frontier generation.
# In run_sweep, each candidate is evaluated:
rejected = metrics.memory_mb > active_memory_budget_mb
row = {
    "accepted": not rejected,
    "active_budget_mb": active_memory_budget_mb,
    **asdict(metrics),
    **violations  # Per-budget violation flags
}
This constraint-first filtering ensures infeasible candidates do not distort operating-point selection.
7

Reporting

The pipeline generates:
  • Sweep tables: sweep_results.csv
  • Pareto frontiers: pareto_frontier_latency.csv, pareto_frontier_energy.csv
  • Summary JSON: summary.json
  • Hardware analysis: layerwise_breakdown.csv, precision_tradeoffs.csv, hardware_summary.csv
  • Visualizations: accuracy vs latency/energy/memory plots, layer-wise activation memory and MACs

Design Decisions

A fixed network topology isolates pruning and precision effects from architecture search noise. The compact CNN is used to keep iteration cycle times short while retaining realistic convolutional operator behavior.Trade-off: Simplicity vs representational capacity. The small architecture enables rapid experimentation but may not capture all real-world deployment complexities.
Structured pruning removes whole channels to preserve dense kernels and straightforward deployment compatibility. Unlike unstructured pruning, this approach:
  • Maintains dense tensor operations (no sparse kernel support needed)
  • Reduces actual runtime memory and compute (not just parameter count)
  • Simplifies hardware deployment (no specialized sparse accelerators required)
See edge_opt.pruning.structured_channel_prune in Model Optimization for implementation details.
Precision modes (fp32, fp16, int8) are explicit to keep evaluation paths auditable:
  • FP32: Baseline floating-point (no conversion)
  • FP16: Half-precision using .half() conversion
  • INT8: Post-training static quantization with fbgemm backend
Each precision path is independently verifiable and produces deterministic results given fixed calibration data.
Memory budget checks run before Pareto frontier analysis. This design ensures:
  • Infeasible candidates are explicitly marked as rejected
  • Pareto frontiers only include deployable configurations
  • Operating-point selection respects hard constraints
The active_memory_budget_mb parameter acts as the hard acceptance threshold, while memory_budgets_mb provides additional violation flags for reporting.
Pareto frontiers are computed after constraint filtering to avoid infeasible configurations:
def pareto_frontier(df: pd.DataFrame, x_col: str) -> pd.DataFrame:
    # Only consider accepted candidates
    ranked = df[df["accepted"]].sort_values(
        [x_col, "accuracy"], 
        ascending=[True, False]
    )
    frontier = []
    best_accuracy = -1.0
    for _, row in ranked.iterrows():
        if row["accuracy"] > best_accuracy:
            frontier.append(row)
            best_accuracy = row["accuracy"]
    return pd.DataFrame(frontier)
Separate frontiers are generated for latency-accuracy and energy-accuracy trade-offs.

Operational Constraints

CPU Execution Only

The default pipeline runs CPU-only execution to reflect common edge integration constraints where accelerator access may be limited.Rationale: Many edge devices lack GPU or specialized accelerators. CPU-focused benchmarking provides realistic baseline estimates.

No Distributed Training

Single-node training only. No multi-GPU or distributed training support.Rationale: Edge deployment targets are typically single-device inference scenarios. Training infrastructure is simplified to match.

FBGEMM Quantization Backend

INT8 quantization defaults to PyTorch’s fbgemm backend for x86 CPU targets.Rationale: FBGEMM provides optimized INT8 kernels for server and edge x86 processors. Alternative backends (e.g., QNNPACK for ARM) require configuration changes.

Config-Driven Workflow

Most experiment knobs are externalized in YAML to support repeatable benchmark sweeps.Rationale: Configuration files enable version control, reproducibility, and systematic hyperparameter exploration without code changes.

Deployment Challenges

Batch-Size Sensitivity: Throughput estimates are sensitive to selected batch size and host load. Single-request latency behavior may differ significantly from batch inference measurements.
Memory Headroom: Production deployments typically require tighter limits than nominal model-size estimates. Memory headroom margins should account for activation buffers, runtime overhead, and OS memory pressure.
Host-Level Contention: Co-scheduled workloads, thermal throttling, and CPU frequency scaling can significantly alter latency distributions. Hardware analysis estimates do not include cache-miss penalties or kernel launch overhead.

Module Reference

The pipeline architecture is implemented across these core modules:
ModuleLocationResponsibility
Configsrc/edge_opt/config.pyYAML parsing, configuration validation
Datasrc/edge_opt/data.pyDataset loading, deterministic loaders
Modelsrc/edge_opt/model.pySmallCNN architecture, deterministic seeding
Pruningsrc/edge_opt/pruning.pyStructured channel pruning
Quantizationsrc/edge_opt/quantization.pyFP16 and INT8 conversion
Metricssrc/edge_opt/metrics.pyPerformance measurement, constraint checking
Experimentssrc/edge_opt/experiments.pyTraining, sweep orchestration, Pareto frontiers
Hardwaresrc/edge_opt/hardware.pyLayer-wise analysis, bandwidth utilization
Multi-Seed Orchestration: Add multi-seed experiment aggregation and confidence intervals to improve statistical rigor.
Hardware Counters: Integrate performance monitoring unit (PMU) counters for cache, bandwidth, and instruction-level profiling to replace software-level estimates.
Artifact Manifests: Introduce model checksum and dataset version metadata to ensure full reproducibility and artifact traceability.

Next Steps

Model Optimization

Learn about pruning and quantization techniques

Hardware Constraints

Understand memory budgets and performance modeling

Build docs developers (and LLMs) love