Scenario
Digit recognition on memory-constrained edge hardware demonstrates the extreme end of resource-limited deployment. This case study addresses:- Strict memory budgets (often < 1MB for model + runtime)
- Limited compute capacity (low-power CPUs without acceleration)
- Deterministic latency requirements for real-time processing
- Power consumption constraints for battery-operated devices
This scenario represents embedded systems like IoT devices, microcontrollers, or edge sensors where every byte and CPU cycle matters.
System Design Decisions
Architecture Overview
The embedded digit classifier implements aggressive optimization for minimal resource footprint:ONNX Export with CPU-Only Path
Model exported to ONNX format with CPU-only operators, ensuring compatibility with resource-constrained devices without GPU/NPU.
Small Batch-Size Inference
Single-sample or micro-batch inference (batch size ≤ 4) for deterministic tail latency and minimal memory allocation.
Optimization Strategy
Model Architecture
Memory Budget
Static Allocation
Model Weights: 200KB - 800KB (quantized)Runtime Overhead: 100KB - 300KBActivation Memory: 50KB - 200KB per inferenceTotal Budget: ~1MB peak memory usage
Optimization Techniques
Quantization: FP32 → INT8 (4x reduction)Operator Fusion: Reduce intermediate allocationsIn-place Operations: Minimize memory copiesStatic Buffers: Preallocated, reused across inferences
Quantization Benefits
INT8 quantization provides critical advantages for embedded deployment:- Memory
- Performance
- Accuracy
- 4x smaller model weights (FP32 → INT8)
- Reduced activation memory requirements
- Lower bandwidth between memory hierarchy levels
- Enables on-chip caching of entire model
Trade-offs and Bottlenecks
Static Memory Budgets
Static Memory Budgets
Advantages:
- Simplified deployment with predictable resource usage
- No dynamic allocation overhead or fragmentation
- Deterministic latency without garbage collection pauses
- Constrains model capacity and complexity
- Cannot handle variable input sizes efficiently
- Requires careful profiling to set correct buffer sizes
Cache Locality Impact
Cache Locality Impact
Performance Characteristics:
- Cache misses dominate inference time on small cores
- Quantized models fit in L1/L2 cache → 3-5x speedup
- Sequential memory access patterns critical for throughput
- Operator fusion reduces intermediate writes
- Weight layout optimized for access patterns
- Activation reuse minimizes memory traffic
Quantization Artifacts
Quantization Artifacts
Bandwidth Reduction:
- 4x less data movement between memory levels
- Critical for bandwidth-constrained embedded systems
- Enables streaming inference with smaller buffers
- Minor accuracy degradation (<1% for digit classification)
- Calibration required to minimize distribution shift
- Per-layer quantization sensitivity analysis needed
Performance Benchmarking
Accuracy/Latency Frontier
The system tracks multiple model configurations on the Pareto frontier:| Configuration | Accuracy | Latency (ms) | Memory (KB) | Notes |
|---|---|---|---|---|
| Baseline FP32 | 98.5% | 45 | 3200 | Too large for target |
| Pruned FP32 | 97.8% | 32 | 1600 | Still memory-constrained |
| INT8 Quantized | 97.3% | 12 | 800 | Deployed config |
| Aggressive INT8 | 95.1% | 8 | 400 | Accuracy unacceptable |
Latency Breakdown
Assumptions and Limitations
Workload Characteristics
-
Request Pattern
- Short-lived requests with limited concurrent sessions
- No batch processing required (single-sample inference)
- Predictable inter-arrival times or request queuing upstream
-
Resource Availability
- Target device has ≥1MB RAM available
- CPU supports required ONNX operators natively
- Sufficient compute headroom for peak load
Operator Support
ONNX Runtime Version Dependencies
ONNX Runtime Version Dependencies
Critical Limitation: Operator support is bounded by ONNX Runtime version on target devices.Considerations:
- Embedded ONNX Runtime may not support all operators
- Custom operators require target-specific implementation
- Operator version compatibility must be validated
- Fallback to CPU reference implementation may be slow
- Export model with target ONNX opset version
- Validate operator support during deployment planning
- Test on actual target hardware early in development
Model Limitations
- Input Resolution: Fixed 28x28 grayscale images (MNIST format)
- Digit Range: 0-9 only, no multi-digit sequences
- Background Assumptions: Clean, centered digits on plain background
- No Adversarial Robustness: Minimal validation against perturbations
Implementation References
This case study leverages repository components from:- ONNX Export:
workflows/onnx_deployment.pyfor model conversion and quantization - Benchmark Pipeline:
benchmark_pipeline.pyfor accuracy/latency frontier tracking - Digit Classifier:
models/digit_classifier.pybaseline model architecture - Quantization Validation:
workflows/onnx_deployment.pycalibration and accuracy validation
View ONNX Workflow
Explore the ONNX export and quantization pipeline
Benchmark System
Learn about performance tracking and frontier analysis
Key Takeaways
Quantization is Essential
INT8 quantization reduces memory by 4x and improves latency by 2-4x with minimal accuracy loss, making it mandatory for embedded deployment.
Cache Locality Matters More Than FLOPS
On small cores, memory access patterns and cache utilization dominate performance more than raw compute capacity.
Static Budgets Simplify Deployment
Accepting model scale limitations in exchange for predictable resource usage and deterministic latency is a worthwhile trade-off.