Prerequisites
Before starting, ensure you have:- Python 3.8 or higher
- 4GB+ available RAM
- Basic familiarity with PyTorch and model optimization concepts
The pipeline runs on CPU by default to reflect edge deployment constraints. GPU acceleration is not required.
Installation
Install dependencies
Install the required packages from The key dependencies are:
requirements.txt:torchandtorchvisionfor model training and datasetspandasfor sweep result aggregationmatplotlibfor Pareto frontier visualizationpyyamlfor configuration parsingonnxandonnxruntimefor export and deployment simulation
Running the pipeline
Basic execution
Run the optimization pipeline with the default configuration:- Load the Fashion-MNIST dataset with deterministic train/validation splits
- Train a baseline
SmallCNNmodel (2 conv layers, 1 classifier) - Sweep structured pruning levels:
[0.0, 0.25, 0.5, 0.7] - Evaluate precision variants:
fp32,fp16,int8 - Measure latency, throughput, memory footprint, and energy proxy
- Filter candidates by memory budget constraints
- Generate Pareto frontiers for latency-accuracy and energy-accuracy
- Save results, plots, and hardware analysis to
outputs/
The default configuration uses small dataset subsets (12,000 train / 3,000 validation samples) and 2 training epochs for fast iteration. See Configuration Reference for production settings.
Pipeline output
Expected runtime: 2-5 minutes on a modern CPU. The terminal displays a JSON summary at the end:Understanding the configuration
The pipeline is controlled byconfigs/default.yaml. Here are the key parameters:
Key configuration parameters
Determinism controls
Determinism controls
seed: Global random seed for model initialization and trainingdataloader_seed: Separate seed for dataset shufflingnum_workers: DataLoader worker count (set to 2 for reproducibility)benchmark_repeats: Number of latency measurement windows for variance reporting
Hardware constraints
Hardware constraints
memory_bandwidth_gbps: Target device memory bandwidth (used for bandwidth utilization estimates)power_watts: Fixed power draw assumption for energy proxy calculationmemory_budgets_mb: List of SRAM-style memory limits to check violations againstactive_memory_budget_mb: Hard threshold for candidate acceptance/rejectioncpu_frequency_scale: Simulates lower clock frequency (scales latency by1.0 / scale)
Optimization sweep
Optimization sweep
pruning_levels: Channel pruning ratios to evaluate (0.0 = no pruning, 0.7 = 70% channels removed)precisions: Numeric formats to test (fp32,fp16,int8)calibration_batches: Number of batches for INT8 quantization calibration
len(pruning_levels) * len(precisions) = 12 candidates by default.Dataset and training
Dataset and training
dataset: Currently supportsfashion-mnistandmnistbatch_size: Training and inference batch sizeepochs: Training epochs for baseline model (kept low for fast iteration)train_subset/val_subset: Dataset size limits for controlled experiments
Output artifacts
All results are saved to the directory specified byoutput_dir (default: outputs/):
sweep_results.csv
Complete sweep table with all 12 candidates:
pruning_level,precisionaccuracy,latency_ms,latency_std_ms,latency_p95_msthroughput_sps,memory_mb,energy_proxy_jaccepted(boolean under active budget)violates_1.0mb,violates_2.0mb,violates_4.0mb
pareto_frontier_latency.csv
Subset of accepted candidates on the latency-accuracy Pareto frontier. Each point represents a configuration where no other accepted candidate has both lower latency and higher accuracy.
pareto_frontier_energy.csv
Subset of accepted candidates on the energy-accuracy Pareto frontier. Useful for battery-constrained deployments.
summary.json
High-level summary with:
- Baseline metrics
- Best latency and energy configurations
- Pareto frontier counts
- Deployment simulation statistics
layerwise_breakdown.csv
Per-layer analysis of the baseline model:
- Output shape and activation memory
- MACs (multiply-accumulate operations)
- Parameter count and memory
precision_tradeoffs.csv
Aggregated statistics by precision:
- Mean accuracy, latency, memory
- Acceptance ratio under active budget
- Standard deviations
hardware_summary.csv
Hardware-level estimates:
- Total MACs
- Arithmetic intensity (ops per byte)
- Bandwidth utilization
- Roofline model positioning
Plots (PNG)
accuracy_vs_latency.png: Scatter plot with Pareto frontieraccuracy_vs_energy.png: Energy-accuracy tradeoffaccuracy_vs_memory.png: Memory footprint distributionlayerwise_activation_memory.png: Per-layer activation sizeslayerwise_macs.png: Computational cost breakdown
Model architecture
The pipeline usesSmallCNN, a compact convolutional network defined in src/edge_opt/model.py:
model.py
- Input: 28×28 grayscale images (Fashion-MNIST / MNIST)
- Conv1: 1→16 channels, 3×3 kernel, followed by ReLU + MaxPool
- Conv2: 16→32 channels, 3×3 kernel, followed by ReLU + MaxPool
- Classifier: Fully connected layer (1568 → 10 classes)
- Total parameters (baseline): ~51,000
- Model size (FP32): ~0.31 MB
Why this architecture? SmallCNN balances realistic convolutional operator behavior with fast iteration cycles. It’s large enough to exhibit meaningful pruning/quantization tradeoffs but small enough to sweep configurations in minutes.
Interpreting results
Accuracy vs latency tradeoff
Theaccuracy_vs_latency.png plot shows three categories:
- Accepted candidates (blue): Models that fit within
active_memory_budget_mb - Rejected candidates (gray X): Models exceeding memory budget
- Pareto frontier (red line): Optimal configurations where no other point has both lower latency and higher accuracy
- Heavier pruning (0.5, 0.7) reduces latency but may sacrifice accuracy
- INT8 quantization often provides 2-3× latency improvement with <2% accuracy loss
- FP16 offers a middle ground between FP32 accuracy and INT8 speed
Memory budget violations
The sweep table includes boolean flags for each budget threshold:Latency statistics
Each candidate reports three latency metrics:latency_ms: Mean latency acrossbenchmark_repeatswindowslatency_std_ms: Standard deviation (indicates measurement stability)latency_p95_ms: 95th percentile (important for tail latency SLAs)
sweep_results.csv:
Energy proxy
The energy estimate is computed as:power_watts: 5.0:
- Latency 10ms → Energy 0.05 J
- Latency 5ms → Energy 0.025 J
- Actual CPU/accelerator power draw during inference
- Memory access patterns and bandwidth
- Idle vs active power states
For production energy budgeting, measure power draw using hardware tools like Joulescope, PowerMonitor, or on-device PMICs (Power Management ICs).
Next steps
Configuration Guide
Customize sweep parameters, hardware constraints, and training hyperparameters
Model Optimization
Deep dive into pruning strategies, quantization calibration, and precision modes
Hardware Analysis
Analyze layer-wise bottlenecks, memory bandwidth, and arithmetic intensity
Deployment Guide
Optimize deployment with Pareto frontiers and memory budget constraints
Troubleshooting
ImportError: No module named 'edge_opt'
ImportError: No module named 'edge_opt'
Ensure you’ve set the Python path:Or run from the repository root with absolute imports.
Dataset download hangs or fails
Dataset download hangs or fails
Fashion-MNIST/MNIST downloads from
torchvision.datasets may fail due to network issues or mirror downtime.Solution:- Check your internet connection
- Clear the torchvision cache:
rm -rf ~/.cache/torch/datasets - Retry the pipeline
dataset_root in data.py to the local path.High latency variance (latency_std_ms > 2ms)
High latency variance (latency_std_ms > 2ms)
Possible causes:
- Host CPU load from other processes
- Thermal throttling
- Insufficient
benchmark_repeats(increase to 10-20)
- Close background applications
- Run on a dedicated benchmarking host
- Increase
benchmark_repeatsinconfigs/default.yaml
All candidates rejected (accepted=false)
All candidates rejected (accepted=false)
If
active_memory_budget_mb is too aggressive, all configurations may be rejected.Solution:- Increase
active_memory_budget_mb(e.g., from 1.0 to 2.0 MB) - Check
memory_mbin the baseline output to understand actual model sizes - Adjust
pruning_levelsto include more aggressive pruning (e.g., [0.5, 0.7, 0.85])
INT8 accuracy drops >5%
INT8 accuracy drops >5%
Quantization calibration may be insufficient.Solution:
- Increase
calibration_batchesfrom 8 to 16-32 - Use a larger
train_subsetfor better activation range estimation - Inspect layer-wise quantization sensitivity (see Quantization Guide)
For additional support, check the GitHub Issues or consult the Contributing Guide.