Quickstart

This guide walks you through installing dependencies, running the baseline pipeline, and interpreting the hardware optimization results. You’ll train a compact CNN, sweep pruning and precision variants, and generate Pareto-optimal configurations under memory constraints.

Prerequisites

Before starting, ensure you have:

Python 3.8 or higher
4GB+ available RAM
Basic familiarity with PyTorch and model optimization concepts

The pipeline runs on CPU by default to reflect edge deployment constraints. GPU acceleration is not required.

Installation

Clone the repository

Clone the Edge AI Hardware Optimization repository to your local machine:

git clone <repository-url>
cd edge-ai-hardware-optimization

Create a virtual environment

Set up an isolated Python environment to avoid dependency conflicts:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies

Install the required packages from requirements.txt:

pip install -r requirements.txt

The key dependencies are:

torch and torchvision for model training and datasets
pandas for sweep result aggregation
matplotlib for Pareto frontier visualization
pyyaml for configuration parsing
onnx and onnxruntime for export and deployment simulation

Set the Python path

Add the src directory to your Python path so modules can be imported:

export PYTHONPATH=src

Add this to your .bashrc or .zshrc to make it persistent across sessions.

Running the pipeline

Basic execution

Run the optimization pipeline with the default configuration:

python scripts/run_pipeline.py --config configs/default.yaml

The pipeline will:

Load the Fashion-MNIST dataset with deterministic train/validation splits
Train a baseline SmallCNN model (2 conv layers, 1 classifier)
Sweep structured pruning levels: [0.0, 0.25, 0.5, 0.7]
Evaluate precision variants: fp32, fp16, int8
Measure latency, throughput, memory footprint, and energy proxy
Filter candidates by memory budget constraints
Generate Pareto frontiers for latency-accuracy and energy-accuracy
Save results, plots, and hardware analysis to outputs/

The default configuration uses small dataset subsets (12,000 train / 3,000 validation samples) and 2 training epochs for fast iteration. See Configuration Reference for production settings.

Pipeline output

Expected runtime: 2-5 minutes on a modern CPU. The terminal displays a JSON summary at the end:

{
  "baseline": {
    "accuracy": 0.876,
    "latency_ms": 12.3,
    "latency_std_ms": 0.4,
    "latency_p95_ms": 12.9,
    "throughput_sps": 10406.5,
    "memory_mb": 0.31,
    "energy_proxy_j": 0.0615,
    "violates_1.0mb": false,
    "violates_2.0mb": false,
    "violates_4.0mb": false,
    "accepted_under_active_budget": true
  },
  "sweep_summary": {
    "total_candidates": 12,
    "accepted_candidates": 10,
    "best_latency_config": {
      "pruning_level": 0.7,
      "precision": "int8",
      "accuracy": 0.823,
      "latency_ms": 6.8
    },
    "best_energy_config": {
      "pruning_level": 0.5,
      "precision": "int8",
      "accuracy": 0.854,
      "energy_proxy_j": 0.034
    }
  },
  "pareto_frontier_latency": 4,
  "pareto_frontier_energy": 4
}

Understanding the configuration

The pipeline is controlled by configs/default.yaml. Here are the key parameters:

seed: 7
dataloader_seed: 7
num_workers: 2
benchmark_repeats: 5

# Hardware constraints
memory_bandwidth_gbps: 12.8
power_watts: 5.0
memory_budgets_mb: [1.0, 2.0, 4.0]
active_memory_budget_mb: 2.0
cpu_frequency_scale: 0.7

# Dataset and training
dataset: fashion-mnist
batch_size: 128
epochs: 2
learning_rate: 0.001
train_subset: 12000
val_subset: 3000

# Optimization sweep
pruning_levels: [0.0, 0.25, 0.5, 0.7]
precisions: [fp32, fp16, int8]
calibration_batches: 8

output_dir: outputs

Key configuration parameters

Determinism controls

seed: Global random seed for model initialization and training
dataloader_seed: Separate seed for dataset shuffling
num_workers: DataLoader worker count (set to 2 for reproducibility)
benchmark_repeats: Number of latency measurement windows for variance reporting

Deterministic mode disables some PyTorch optimizations. For production benchmarking, consider relaxing these constraints after validating correctness.

Hardware constraints

memory_bandwidth_gbps: Target device memory bandwidth (used for bandwidth utilization estimates)
power_watts: Fixed power draw assumption for energy proxy calculation
memory_budgets_mb: List of SRAM-style memory limits to check violations against
active_memory_budget_mb: Hard threshold for candidate acceptance/rejection
cpu_frequency_scale: Simulates lower clock frequency (scales latency by 1.0 / scale)

These parameters model edge device constraints like ARM Cortex-M7 or low-power Cortex-A cores.

Optimization sweep

pruning_levels: Channel pruning ratios to evaluate (0.0 = no pruning, 0.7 = 70% channels removed)
precisions: Numeric formats to test (fp32, fp16, int8)
calibration_batches: Number of batches for INT8 quantization calibration

Total sweep cardinality: len(pruning_levels) * len(precisions) = 12 candidates by default.

Dataset and training

dataset: Currently supports fashion-mnist and mnist
batch_size: Training and inference batch size
epochs: Training epochs for baseline model (kept low for fast iteration)
train_subset / val_subset: Dataset size limits for controlled experiments

For production-grade accuracy, increase epochs to 10-20 and remove subset limits.

Output artifacts

All results are saved to the directory specified by output_dir (default: outputs/):

sweep_results.csv

Complete sweep table with all 12 candidates:

pruning_level, precision
accuracy, latency_ms, latency_std_ms, latency_p95_ms
throughput_sps, memory_mb, energy_proxy_j
accepted (boolean under active budget)
violates_1.0mb, violates_2.0mb, violates_4.0mb

pareto_frontier_latency.csv

Subset of accepted candidates on the latency-accuracy Pareto frontier. Each point represents a configuration where no other accepted candidate has both lower latency and higher accuracy.

pareto_frontier_energy.csv

Subset of accepted candidates on the energy-accuracy Pareto frontier. Useful for battery-constrained deployments.

summary.json

High-level summary with:

Baseline metrics
Best latency and energy configurations
Pareto frontier counts
Deployment simulation statistics

layerwise_breakdown.csv

Per-layer analysis of the baseline model:

Output shape and activation memory
MACs (multiply-accumulate operations)
Parameter count and memory

Identifies bottleneck layers for targeted optimization.

precision_tradeoffs.csv

Aggregated statistics by precision:

Mean accuracy, latency, memory
Acceptance ratio under active budget
Standard deviations

hardware_summary.csv

Hardware-level estimates:

Total MACs
Arithmetic intensity (ops per byte)
Bandwidth utilization
Roofline model positioning

Plots (PNG)

accuracy_vs_latency.png: Scatter plot with Pareto frontier
accuracy_vs_energy.png: Energy-accuracy tradeoff
accuracy_vs_memory.png: Memory footprint distribution
layerwise_activation_memory.png: Per-layer activation sizes
layerwise_macs.png: Computational cost breakdown

Model architecture

The pipeline uses SmallCNN, a compact convolutional network defined in src/edge_opt/model.py:

model.py

class SmallCNN(nn.Module):
    def __init__(self, conv1_channels: int = 16, conv2_channels: int = 32, num_classes: int = 10) -> None:
        super().__init__()
        self.conv1 = nn.Conv2d(1, conv1_channels, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(conv1_channels, conv2_channels, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2)
        self.relu = nn.ReLU(inplace=True)
        self.classifier = nn.Linear(conv2_channels * 7 * 7, num_classes)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.pool(self.relu(self.conv1(x)))  # 28x28 -> 14x14
        x = self.pool(self.relu(self.conv2(x)))  # 14x14 -> 7x7
        x = x.flatten(start_dim=1)
        return self.classifier(x)

Architecture details:

Input: 28×28 grayscale images (Fashion-MNIST / MNIST)
Conv1: 1→16 channels, 3×3 kernel, followed by ReLU + MaxPool
Conv2: 16→32 channels, 3×3 kernel, followed by ReLU + MaxPool
Classifier: Fully connected layer (1568 → 10 classes)
Total parameters (baseline): ~51,000
Model size (FP32): ~0.31 MB

Why this architecture? SmallCNN balances realistic convolutional operator behavior with fast iteration cycles. It’s large enough to exhibit meaningful pruning/quantization tradeoffs but small enough to sweep configurations in minutes.

Interpreting results

Accuracy vs latency tradeoff

The accuracy_vs_latency.png plot shows three categories:

Accepted candidates (blue): Models that fit within active_memory_budget_mb
Rejected candidates (gray X): Models exceeding memory budget
Pareto frontier (red line): Optimal configurations where no other point has both lower latency and higher accuracy

Key insights:

Heavier pruning (0.5, 0.7) reduces latency but may sacrifice accuracy
INT8 quantization often provides 2-3× latency improvement with <2% accuracy loss
FP16 offers a middle ground between FP32 accuracy and INT8 speed

Memory budget violations

The sweep table includes boolean flags for each budget threshold:

pruning_level,precision,memory_mb,violates_1.0mb,violates_2.0mb,violates_4.0mb,accepted
0.0,fp32,0.31,false,false,false,true
0.0,fp16,0.16,false,false,false,true
0.0,int8,0.08,false,false,false,true
0.25,fp32,0.23,false,false,false,true
...

Production deployment caveat: These memory estimates only account for model parameters. Real edge systems must also budget for:

Intermediate activation tensors
Input/output buffers
Framework overhead
OS and application memory

Rule of thumb: Reserve 2-4× the model size for total SRAM budget.

Latency statistics

Each candidate reports three latency metrics:

latency_ms: Mean latency across benchmark_repeats windows
latency_std_ms: Standard deviation (indicates measurement stability)
latency_p95_ms: 95th percentile (important for tail latency SLAs)

Example from sweep_results.csv:

pruning_level,precision,latency_ms,latency_std_ms,latency_p95_ms
0.5,int8,7.2,0.3,7.6

CPU frequency scaling: The configuration parameter cpu_frequency_scale: 0.7 simulates running at 70% clock frequency. Latency is multiplied by 1.0 / 0.7 ≈ 1.43 to model this constraint. Adjust this based on your target device’s DVFS (dynamic voltage/frequency scaling) settings.

Energy proxy

The energy estimate is computed as:

energy_proxy_j = (latency_ms / 1000.0) * power_watts

With default power_watts: 5.0:

Latency 10ms → Energy 0.05 J
Latency 5ms → Energy 0.025 J

This is a first-order approximation. Real energy consumption depends on:

Actual CPU/accelerator power draw during inference
Memory access patterns and bandwidth
Idle vs active power states

For production energy budgeting, measure power draw using hardware tools like Joulescope, PowerMonitor, or on-device PMICs (Power Management ICs).

Next steps

Configuration Guide

Customize sweep parameters, hardware constraints, and training hyperparameters

Model Optimization

Deep dive into pruning strategies, quantization calibration, and precision modes

Hardware Analysis

Analyze layer-wise bottlenecks, memory bandwidth, and arithmetic intensity

Deployment Guide

Optimize deployment with Pareto frontiers and memory budget constraints

Troubleshooting

ImportError: No module named 'edge_opt'

Ensure you’ve set the Python path:

export PYTHONPATH=src

Or run from the repository root with absolute imports.

Dataset download hangs or fails

Fashion-MNIST/MNIST downloads from torchvision.datasets may fail due to network issues or mirror downtime.Solution:

Check your internet connection
Clear the torchvision cache: rm -rf ~/.cache/torch/datasets
Retry the pipeline

For air-gapped environments, manually download datasets and point dataset_root in data.py to the local path.

High latency variance (latency_std_ms > 2ms)

Possible causes:

Host CPU load from other processes
Thermal throttling
Insufficient benchmark_repeats (increase to 10-20)

Solution:

Close background applications
Run on a dedicated benchmarking host
Increase benchmark_repeats in configs/default.yaml

All candidates rejected (accepted=false)

If active_memory_budget_mb is too aggressive, all configurations may be rejected.Solution:

Increase active_memory_budget_mb (e.g., from 1.0 to 2.0 MB)
Check memory_mb in the baseline output to understand actual model sizes
Adjust pruning_levels to include more aggressive pruning (e.g., [0.5, 0.7, 0.85])

INT8 accuracy drops >5%

Quantization calibration may be insufficient.Solution:

Increase calibration_batches from 8 to 16-32
Use a larger train_subset for better activation range estimation
Inspect layer-wise quantization sensitivity (see Quantization Guide)

For additional support, check the GitHub Issues or consult the Contributing Guide.

Get Started

Core Concepts

Guides

Hardware Analysis

Deployment

Prerequisites

Installation

Running the pipeline

Basic execution

Pipeline output

Understanding the configuration

Key configuration parameters

Output artifacts

sweep_results.csv

pareto_frontier_latency.csv

pareto_frontier_energy.csv

summary.json

layerwise_breakdown.csv

precision_tradeoffs.csv

hardware_summary.csv

Plots (PNG)

Model architecture

Interpreting results

Accuracy vs latency tradeoff

Memory budget violations

Latency statistics

Energy proxy

Next steps

Configuration Guide

Model Optimization

Hardware Analysis

Deployment Guide

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Hardware Analysis

Deployment

​Prerequisites

​Installation

​Running the pipeline

​Basic execution

​Pipeline output

​Understanding the configuration

​Key configuration parameters

​Output artifacts

sweep_results.csv

pareto_frontier_latency.csv

pareto_frontier_energy.csv

summary.json

layerwise_breakdown.csv

precision_tradeoffs.csv

hardware_summary.csv

Plots (PNG)

​Model architecture

​Interpreting results

​Accuracy vs latency tradeoff

​Memory budget violations

​Latency statistics

​Energy proxy

​Next steps

Configuration Guide

Model Optimization

Hardware Analysis

Deployment Guide

​Troubleshooting

Build docs developers (and LLMs) love

Prerequisites

Installation

Running the pipeline

Basic execution

Pipeline output

Understanding the configuration

Key configuration parameters

Output artifacts

Model architecture

Interpreting results

Accuracy vs latency tradeoff

Memory budget violations

Latency statistics

Energy proxy

Next steps

Troubleshooting