System Architecture

Overview

The Edge AI Hardware Optimization framework implements a constraint-first optimization pipeline designed for evaluating compact CNN deployments under edge-device constraints. The architecture prioritizes deterministic execution, measurable trade-offs, and low-complexity implementation suitable for iterative experimentation.

Pipeline Stages

The optimization pipeline consists of seven sequential stages that transform a baseline model into deployment-ready candidates:

Configuration Load

The edge_opt.config module parses YAML configuration into a typed ExperimentConfig dataclass.

from edge_opt.config import load_config

config = load_config("configs/default.yaml")
# Returns ExperimentConfig with validated parameters:
# - seed, dataset, batch_size, epochs
# - pruning_levels, precisions
# - memory_budgets_mb, active_memory_budget_mb
# - cpu_frequency_scale, memory_bandwidth_gbps

The configuration loader supports scalar parsing with automatic type inference for integers, floats, booleans, lists, and None values.

Dataset and Loader Setup

The edge_opt.data module builds deterministic train/validation loaders with reproducible shuffling.

from edge_opt.data import build_loaders

train_loader, val_loader = build_loaders(
    dataset_name="mnist",
    batch_size=config.batch_size,
    train_subset=config.train_subset,
    val_subset=config.val_subset,
    seed=config.dataloader_seed,
    num_workers=config.num_workers
)

Supported datasets: mnist and fashion-mnist. Loaders use fixed Generator seeds to ensure reproducible batch ordering across runs.

Baseline Training

The edge_opt.experiments.train_model function trains the compact CNN using Adam optimizer and cross-entropy loss.

from edge_opt.model import SmallCNN
from edge_opt.experiments import train_model

base_model = SmallCNN(conv1_channels=16, conv2_channels=32)
trained_model = train_model(
    model=base_model,
    train_loader=train_loader,
    epochs=config.epochs,
    learning_rate=config.learning_rate,
    device=device
)

The SmallCNN architecture consists of:

Conv1: 1 → 16 channels (3x3 kernel, padding=1)
MaxPool + ReLU
Conv2: 16 → 32 channels (3x3 kernel, padding=1)
MaxPool + ReLU
Flatten + Linear classifier: 32×7×7 → 10 classes

Optimization Sweep

The edge_opt.experiments.run_sweep function applies pruning and precision variants across the configuration space.

from edge_opt.experiments import run_sweep

sweep_df = run_sweep(
    base_model=trained_model,
    val_loader=val_loader,
    calibration_loader=train_loader,
    device=device,
    pruning_levels=config.pruning_levels,      # e.g., [0.0, 0.3, 0.5, 0.7]
    precisions=config.precisions,              # e.g., ["fp32", "fp16", "int8"]
    power_watts=config.power_watts,
    calibration_batches=config.calibration_batches,
    memory_budgets_mb=config.memory_budgets_mb,
    active_memory_budget_mb=config.active_memory_budget_mb,
    latency_multiplier=config.cpu_frequency_scale,
    benchmark_repeats=config.benchmark_repeats
)

Sweep cardinality scales as len(pruning_levels) × len(precisions). Each candidate is evaluated independently.

Metric Collection

The edge_opt.metrics module computes comprehensive performance metrics for each candidate.Collected metrics:

Accuracy: Validation set classification accuracy
Latency: Mean, standard deviation, and P95 inference time (ms)
Throughput: Samples per second
Memory: Model footprint from state dict (MB)
Energy Proxy: latency_seconds × power_watts (J)

Latency measurements include warmup iterations to stabilize CPU cache state. Throughput is sensitive to batch size and host load.

Constraint Filtering

Candidates are classified by the active memory budget before Pareto frontier generation.

# In run_sweep, each candidate is evaluated:
rejected = metrics.memory_mb > active_memory_budget_mb
row = {
    "accepted": not rejected,
    "active_budget_mb": active_memory_budget_mb,
    **asdict(metrics),
    **violations  # Per-budget violation flags
}

This constraint-first filtering ensures infeasible candidates do not distort operating-point selection.

Reporting

The pipeline generates:

Sweep tables: sweep_results.csv
Pareto frontiers: pareto_frontier_latency.csv, pareto_frontier_energy.csv
Summary JSON: summary.json
Hardware analysis: layerwise_breakdown.csv, precision_tradeoffs.csv, hardware_summary.csv
Visualizations: accuracy vs latency/energy/memory plots, layer-wise activation memory and MACs

Design Decisions

Compact CNN Architecture

A fixed network topology isolates pruning and precision effects from architecture search noise. The compact CNN is used to keep iteration cycle times short while retaining realistic convolutional operator behavior.Trade-off: Simplicity vs representational capacity. The small architecture enables rapid experimentation but may not capture all real-world deployment complexities.

Structured Channel Pruning

Structured pruning removes whole channels to preserve dense kernels and straightforward deployment compatibility. Unlike unstructured pruning, this approach:

Maintains dense tensor operations (no sparse kernel support needed)
Reduces actual runtime memory and compute (not just parameter count)
Simplifies hardware deployment (no specialized sparse accelerators required)

See edge_opt.pruning.structured_channel_prune in Model Optimization for implementation details.

Explicit Precision Conversion

Precision modes (fp32, fp16, int8) are explicit to keep evaluation paths auditable:

FP32: Baseline floating-point (no conversion)
FP16: Half-precision using .half() conversion
INT8: Post-training static quantization with fbgemm backend

Each precision path is independently verifiable and produces deterministic results given fixed calibration data.

Constraint-First Filtering

Memory budget checks run before Pareto frontier analysis. This design ensures:

Infeasible candidates are explicitly marked as rejected
Pareto frontiers only include deployable configurations
Operating-point selection respects hard constraints

The active_memory_budget_mb parameter acts as the hard acceptance threshold, while memory_budgets_mb provides additional violation flags for reporting.

Pareto Frontier Generation

Pareto frontiers are computed after constraint filtering to avoid infeasible configurations:

def pareto_frontier(df: pd.DataFrame, x_col: str) -> pd.DataFrame:
    # Only consider accepted candidates
    ranked = df[df["accepted"]].sort_values(
        [x_col, "accuracy"], 
        ascending=[True, False]
    )
    frontier = []
    best_accuracy = -1.0
    for _, row in ranked.iterrows():
        if row["accuracy"] > best_accuracy:
            frontier.append(row)
            best_accuracy = row["accuracy"]
    return pd.DataFrame(frontier)

Separate frontiers are generated for latency-accuracy and energy-accuracy trade-offs.

Operational Constraints

CPU Execution Only

The default pipeline runs CPU-only execution to reflect common edge integration constraints where accelerator access may be limited.Rationale: Many edge devices lack GPU or specialized accelerators. CPU-focused benchmarking provides realistic baseline estimates.

No Distributed Training

Single-node training only. No multi-GPU or distributed training support.Rationale: Edge deployment targets are typically single-device inference scenarios. Training infrastructure is simplified to match.

FBGEMM Quantization Backend

INT8 quantization defaults to PyTorch’s fbgemm backend for x86 CPU targets.Rationale: FBGEMM provides optimized INT8 kernels for server and edge x86 processors. Alternative backends (e.g., QNNPACK for ARM) require configuration changes.

Config-Driven Workflow

Most experiment knobs are externalized in YAML to support repeatable benchmark sweeps.Rationale: Configuration files enable version control, reproducibility, and systematic hyperparameter exploration without code changes.

Deployment Challenges

Batch-Size Sensitivity: Throughput estimates are sensitive to selected batch size and host load. Single-request latency behavior may differ significantly from batch inference measurements.

Memory Headroom: Production deployments typically require tighter limits than nominal model-size estimates. Memory headroom margins should account for activation buffers, runtime overhead, and OS memory pressure.

Host-Level Contention: Co-scheduled workloads, thermal throttling, and CPU frequency scaling can significantly alter latency distributions. Hardware analysis estimates do not include cache-miss penalties or kernel launch overhead.

Module Reference

The pipeline architecture is implemented across these core modules:

Module	Location	Responsibility
Config	`src/edge_opt/config.py`	YAML parsing, configuration validation
Data	`src/edge_opt/data.py`	Dataset loading, deterministic loaders
Model	`src/edge_opt/model.py`	`SmallCNN` architecture, deterministic seeding
Pruning	`src/edge_opt/pruning.py`	Structured channel pruning
Quantization	`src/edge_opt/quantization.py`	FP16 and INT8 conversion
Metrics	`src/edge_opt/metrics.py`	Performance measurement, constraint checking
Experiments	`src/edge_opt/experiments.py`	Training, sweep orchestration, Pareto frontiers
Hardware	`src/edge_opt/hardware.py`	Layer-wise analysis, bandwidth utilization

Recommended Extensions

Multi-Seed Orchestration: Add multi-seed experiment aggregation and confidence intervals to improve statistical rigor.

Hardware Counters: Integrate performance monitoring unit (PMU) counters for cache, bandwidth, and instruction-level profiling to replace software-level estimates.

Artifact Manifests: Introduce model checksum and dataset version metadata to ensure full reproducibility and artifact traceability.

Get Started

Core Concepts

Guides

Hardware Analysis

Deployment

Overview

Pipeline Stages

Design Decisions

Operational Constraints

CPU Execution Only

No Distributed Training

FBGEMM Quantization Backend

Config-Driven Workflow

Deployment Challenges

Module Reference

Recommended Extensions

Next Steps

Model Optimization

Hardware Constraints

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Hardware Analysis

Deployment

​Overview

​Pipeline Stages

​Design Decisions

​Operational Constraints

CPU Execution Only

No Distributed Training

FBGEMM Quantization Backend

Config-Driven Workflow

​Deployment Challenges

​Module Reference

​Recommended Extensions

​Next Steps

Model Optimization

Hardware Constraints

Build docs developers (and LLMs) love

Overview

Pipeline Stages

Design Decisions

Operational Constraints

Deployment Challenges

Module Reference

Recommended Extensions

Next Steps