Skip to main content
This guide walks you through installing dependencies, running the baseline pipeline, and interpreting the hardware optimization results. You’ll train a compact CNN, sweep pruning and precision variants, and generate Pareto-optimal configurations under memory constraints.

Prerequisites

Before starting, ensure you have:
  • Python 3.8 or higher
  • 4GB+ available RAM
  • Basic familiarity with PyTorch and model optimization concepts
The pipeline runs on CPU by default to reflect edge deployment constraints. GPU acceleration is not required.

Installation

1

Clone the repository

Clone the Edge AI Hardware Optimization repository to your local machine:
git clone <repository-url>
cd edge-ai-hardware-optimization
2

Create a virtual environment

Set up an isolated Python environment to avoid dependency conflicts:
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
3

Install dependencies

Install the required packages from requirements.txt:
pip install -r requirements.txt
The key dependencies are:
  • torch and torchvision for model training and datasets
  • pandas for sweep result aggregation
  • matplotlib for Pareto frontier visualization
  • pyyaml for configuration parsing
  • onnx and onnxruntime for export and deployment simulation
4

Set the Python path

Add the src directory to your Python path so modules can be imported:
export PYTHONPATH=src
Add this to your .bashrc or .zshrc to make it persistent across sessions.

Running the pipeline

Basic execution

Run the optimization pipeline with the default configuration:
python scripts/run_pipeline.py --config configs/default.yaml
The pipeline will:
  1. Load the Fashion-MNIST dataset with deterministic train/validation splits
  2. Train a baseline SmallCNN model (2 conv layers, 1 classifier)
  3. Sweep structured pruning levels: [0.0, 0.25, 0.5, 0.7]
  4. Evaluate precision variants: fp32, fp16, int8
  5. Measure latency, throughput, memory footprint, and energy proxy
  6. Filter candidates by memory budget constraints
  7. Generate Pareto frontiers for latency-accuracy and energy-accuracy
  8. Save results, plots, and hardware analysis to outputs/
The default configuration uses small dataset subsets (12,000 train / 3,000 validation samples) and 2 training epochs for fast iteration. See Configuration Reference for production settings.

Pipeline output

Expected runtime: 2-5 minutes on a modern CPU. The terminal displays a JSON summary at the end:
{
  "baseline": {
    "accuracy": 0.876,
    "latency_ms": 12.3,
    "latency_std_ms": 0.4,
    "latency_p95_ms": 12.9,
    "throughput_sps": 10406.5,
    "memory_mb": 0.31,
    "energy_proxy_j": 0.0615,
    "violates_1.0mb": false,
    "violates_2.0mb": false,
    "violates_4.0mb": false,
    "accepted_under_active_budget": true
  },
  "sweep_summary": {
    "total_candidates": 12,
    "accepted_candidates": 10,
    "best_latency_config": {
      "pruning_level": 0.7,
      "precision": "int8",
      "accuracy": 0.823,
      "latency_ms": 6.8
    },
    "best_energy_config": {
      "pruning_level": 0.5,
      "precision": "int8",
      "accuracy": 0.854,
      "energy_proxy_j": 0.034
    }
  },
  "pareto_frontier_latency": 4,
  "pareto_frontier_energy": 4
}

Understanding the configuration

The pipeline is controlled by configs/default.yaml. Here are the key parameters:
seed: 7
dataloader_seed: 7
num_workers: 2
benchmark_repeats: 5

# Hardware constraints
memory_bandwidth_gbps: 12.8
power_watts: 5.0
memory_budgets_mb: [1.0, 2.0, 4.0]
active_memory_budget_mb: 2.0
cpu_frequency_scale: 0.7

# Dataset and training
dataset: fashion-mnist
batch_size: 128
epochs: 2
learning_rate: 0.001
train_subset: 12000
val_subset: 3000

# Optimization sweep
pruning_levels: [0.0, 0.25, 0.5, 0.7]
precisions: [fp32, fp16, int8]
calibration_batches: 8

output_dir: outputs

Key configuration parameters

  • seed: Global random seed for model initialization and training
  • dataloader_seed: Separate seed for dataset shuffling
  • num_workers: DataLoader worker count (set to 2 for reproducibility)
  • benchmark_repeats: Number of latency measurement windows for variance reporting
Deterministic mode disables some PyTorch optimizations. For production benchmarking, consider relaxing these constraints after validating correctness.
  • memory_bandwidth_gbps: Target device memory bandwidth (used for bandwidth utilization estimates)
  • power_watts: Fixed power draw assumption for energy proxy calculation
  • memory_budgets_mb: List of SRAM-style memory limits to check violations against
  • active_memory_budget_mb: Hard threshold for candidate acceptance/rejection
  • cpu_frequency_scale: Simulates lower clock frequency (scales latency by 1.0 / scale)
These parameters model edge device constraints like ARM Cortex-M7 or low-power Cortex-A cores.
  • pruning_levels: Channel pruning ratios to evaluate (0.0 = no pruning, 0.7 = 70% channels removed)
  • precisions: Numeric formats to test (fp32, fp16, int8)
  • calibration_batches: Number of batches for INT8 quantization calibration
Total sweep cardinality: len(pruning_levels) * len(precisions) = 12 candidates by default.
  • dataset: Currently supports fashion-mnist and mnist
  • batch_size: Training and inference batch size
  • epochs: Training epochs for baseline model (kept low for fast iteration)
  • train_subset / val_subset: Dataset size limits for controlled experiments
For production-grade accuracy, increase epochs to 10-20 and remove subset limits.

Output artifacts

All results are saved to the directory specified by output_dir (default: outputs/):

sweep_results.csv

Complete sweep table with all 12 candidates:
  • pruning_level, precision
  • accuracy, latency_ms, latency_std_ms, latency_p95_ms
  • throughput_sps, memory_mb, energy_proxy_j
  • accepted (boolean under active budget)
  • violates_1.0mb, violates_2.0mb, violates_4.0mb

pareto_frontier_latency.csv

Subset of accepted candidates on the latency-accuracy Pareto frontier. Each point represents a configuration where no other accepted candidate has both lower latency and higher accuracy.

pareto_frontier_energy.csv

Subset of accepted candidates on the energy-accuracy Pareto frontier. Useful for battery-constrained deployments.

summary.json

High-level summary with:
  • Baseline metrics
  • Best latency and energy configurations
  • Pareto frontier counts
  • Deployment simulation statistics

layerwise_breakdown.csv

Per-layer analysis of the baseline model:
  • Output shape and activation memory
  • MACs (multiply-accumulate operations)
  • Parameter count and memory
Identifies bottleneck layers for targeted optimization.

precision_tradeoffs.csv

Aggregated statistics by precision:
  • Mean accuracy, latency, memory
  • Acceptance ratio under active budget
  • Standard deviations

hardware_summary.csv

Hardware-level estimates:
  • Total MACs
  • Arithmetic intensity (ops per byte)
  • Bandwidth utilization
  • Roofline model positioning

Plots (PNG)

  • accuracy_vs_latency.png: Scatter plot with Pareto frontier
  • accuracy_vs_energy.png: Energy-accuracy tradeoff
  • accuracy_vs_memory.png: Memory footprint distribution
  • layerwise_activation_memory.png: Per-layer activation sizes
  • layerwise_macs.png: Computational cost breakdown

Model architecture

The pipeline uses SmallCNN, a compact convolutional network defined in src/edge_opt/model.py:
model.py
class SmallCNN(nn.Module):
    def __init__(self, conv1_channels: int = 16, conv2_channels: int = 32, num_classes: int = 10) -> None:
        super().__init__()
        self.conv1 = nn.Conv2d(1, conv1_channels, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(conv1_channels, conv2_channels, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2)
        self.relu = nn.ReLU(inplace=True)
        self.classifier = nn.Linear(conv2_channels * 7 * 7, num_classes)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.pool(self.relu(self.conv1(x)))  # 28x28 -> 14x14
        x = self.pool(self.relu(self.conv2(x)))  # 14x14 -> 7x7
        x = x.flatten(start_dim=1)
        return self.classifier(x)
Architecture details:
  • Input: 28×28 grayscale images (Fashion-MNIST / MNIST)
  • Conv1: 1→16 channels, 3×3 kernel, followed by ReLU + MaxPool
  • Conv2: 16→32 channels, 3×3 kernel, followed by ReLU + MaxPool
  • Classifier: Fully connected layer (1568 → 10 classes)
  • Total parameters (baseline): ~51,000
  • Model size (FP32): ~0.31 MB
Why this architecture? SmallCNN balances realistic convolutional operator behavior with fast iteration cycles. It’s large enough to exhibit meaningful pruning/quantization tradeoffs but small enough to sweep configurations in minutes.

Interpreting results

Accuracy vs latency tradeoff

The accuracy_vs_latency.png plot shows three categories:
  1. Accepted candidates (blue): Models that fit within active_memory_budget_mb
  2. Rejected candidates (gray X): Models exceeding memory budget
  3. Pareto frontier (red line): Optimal configurations where no other point has both lower latency and higher accuracy
Key insights:
  • Heavier pruning (0.5, 0.7) reduces latency but may sacrifice accuracy
  • INT8 quantization often provides 2-3× latency improvement with <2% accuracy loss
  • FP16 offers a middle ground between FP32 accuracy and INT8 speed

Memory budget violations

The sweep table includes boolean flags for each budget threshold:
pruning_level,precision,memory_mb,violates_1.0mb,violates_2.0mb,violates_4.0mb,accepted
0.0,fp32,0.31,false,false,false,true
0.0,fp16,0.16,false,false,false,true
0.0,int8,0.08,false,false,false,true
0.25,fp32,0.23,false,false,false,true
...
Production deployment caveat: These memory estimates only account for model parameters. Real edge systems must also budget for:
  • Intermediate activation tensors
  • Input/output buffers
  • Framework overhead
  • OS and application memory
Rule of thumb: Reserve 2-4× the model size for total SRAM budget.

Latency statistics

Each candidate reports three latency metrics:
  • latency_ms: Mean latency across benchmark_repeats windows
  • latency_std_ms: Standard deviation (indicates measurement stability)
  • latency_p95_ms: 95th percentile (important for tail latency SLAs)
Example from sweep_results.csv:
pruning_level,precision,latency_ms,latency_std_ms,latency_p95_ms
0.5,int8,7.2,0.3,7.6
CPU frequency scaling: The configuration parameter cpu_frequency_scale: 0.7 simulates running at 70% clock frequency. Latency is multiplied by 1.0 / 0.7 ≈ 1.43 to model this constraint. Adjust this based on your target device’s DVFS (dynamic voltage/frequency scaling) settings.

Energy proxy

The energy estimate is computed as:
energy_proxy_j = (latency_ms / 1000.0) * power_watts
With default power_watts: 5.0:
  • Latency 10ms → Energy 0.05 J
  • Latency 5ms → Energy 0.025 J
This is a first-order approximation. Real energy consumption depends on:
  • Actual CPU/accelerator power draw during inference
  • Memory access patterns and bandwidth
  • Idle vs active power states
For production energy budgeting, measure power draw using hardware tools like Joulescope, PowerMonitor, or on-device PMICs (Power Management ICs).

Next steps

Configuration Guide

Customize sweep parameters, hardware constraints, and training hyperparameters

Model Optimization

Deep dive into pruning strategies, quantization calibration, and precision modes

Hardware Analysis

Analyze layer-wise bottlenecks, memory bandwidth, and arithmetic intensity

Deployment Guide

Optimize deployment with Pareto frontiers and memory budget constraints

Troubleshooting

Ensure you’ve set the Python path:
export PYTHONPATH=src
Or run from the repository root with absolute imports.
Fashion-MNIST/MNIST downloads from torchvision.datasets may fail due to network issues or mirror downtime.Solution:
  1. Check your internet connection
  2. Clear the torchvision cache: rm -rf ~/.cache/torch/datasets
  3. Retry the pipeline
For air-gapped environments, manually download datasets and point dataset_root in data.py to the local path.
Possible causes:
  • Host CPU load from other processes
  • Thermal throttling
  • Insufficient benchmark_repeats (increase to 10-20)
Solution:
  • Close background applications
  • Run on a dedicated benchmarking host
  • Increase benchmark_repeats in configs/default.yaml
If active_memory_budget_mb is too aggressive, all configurations may be rejected.Solution:
  • Increase active_memory_budget_mb (e.g., from 1.0 to 2.0 MB)
  • Check memory_mb in the baseline output to understand actual model sizes
  • Adjust pruning_levels to include more aggressive pruning (e.g., [0.5, 0.7, 0.85])
Quantization calibration may be insufficient.Solution:
  • Increase calibration_batches from 8 to 16-32
  • Use a larger train_subset for better activation range estimation
  • Inspect layer-wise quantization sensitivity (see Quantization Guide)
For additional support, check the GitHub Issues or consult the Contributing Guide.

Build docs developers (and LLMs) love