Overview
The hardware-aware ML optimization module analyzes deployment scenarios across multiple dimensions: latency, throughput, accuracy, memory footprint, CPU utilization, and energy consumption. It enriches benchmark data with operator-level profiling and hardware efficiency metrics.Prerequisites
Run statistical benchmarks first to generate raw measurement data:Quick Start
Analyze hardware trade-offs:artifacts/stat_benchmark_runs.csv and generates enriched hardware analysis.
Deployment Scenarios
The framework compares three production-ready deployment paths:FP32 sklearn
Native PythonStandard scikit-learn model with 32-bit floats. Easiest to deploy but typically slowest.
- Simple serialization with
joblib - No additional dependencies
- Baseline for comparison
FP32 ONNX
Optimized RuntimeONNX Runtime with 32-bit precision. Optimized graph execution with vectorized CPU kernels.
- 2-3x faster than sklearn
- Same accuracy as training
- Cross-platform compatibility
INT8 ONNX
Quantized InferenceONNX Runtime with 8-bit integer quantization. Reduced memory and compute at slight accuracy cost.
- 3-5x faster than sklearn
- ~75% smaller memory footprint
- Possible accuracy degradation (< 1% typical)
Metrics Tracked
Core Performance Metrics
Mean inference time per sample in milliseconds. Lower is better. Critical for real-time applications.
Number of samples processed per second. Higher is better. Important for batch processing.
Prediction accuracy on test data. Quantization may reduce accuracy slightly (monitor < 1% degradation).
System Resource Metrics
Process resident set size (RSS) in megabytes. Measured via
psutil as memory delta before/after inference.Average CPU utilization percentage during inference. Helps identify CPU-bound workloads.
Energy consumed in microjoules, measured via Intel RAPL counters (when available on Intel CPUs).
Proxy energy estimate (latency × CPU utilization). Used when RAPL counters unavailable.
Energy Profiling
RAPL Counters (Intel CPUs)
When running on Intel systems with RAPL support, the framework reads hardware energy counters:RAPL Availability: Requires Intel CPU with RAPL support and read permissions on
/sys/class/powercap/. Not available on AMD CPUs or non-Linux systems.Proxy Energy Estimate
When RAPL counters unavailable, the framework uses a proxy metric:Operator-Level Profiling
The framework enriches metrics with estimated operator-level latency breakdowns:Latency Decomposition
Preprocessing
Input validation, type conversion, and feature scaling. Typically 18-22% of total latency.
Linear Operator
Core inference computation (matrix multiplication, activation). Dominant cost at 55-63% of latency.
Output Artifacts
hardware_tradeoffs.csv
Comprehensive hardware analysis with enriched metrics:Enriched Metrics
Estimated time spent in preprocessing phase.
Estimated time spent in core model computation.
Estimated time spent in output processing.
Estimated memory bandwidth utilization in GB/s. Calculated as:
Memory usage per unit throughput. Lower indicates better memory efficiency:
Indicates whether scenario uses INT8 quantization or FP32 precision.
hardware_tradeoffs_summary.json
Best-performing scenarios by optimization dimension:Interpretation Guide
Scenario Selection Matrix
- Low-Latency Applications
- Accuracy-Critical Systems
- Resource-Constrained Deployment
- Simple Deployment
Recommended: INT8 ONNXWhen real-time response is critical:
✓ Highest throughput
⚠ Slight accuracy degradation (< 1%)
- User-facing APIs with < 100ms SLA
- High-frequency trading
- Online recommendation systems
✓ Highest throughput
⚠ Slight accuracy degradation (< 1%)
Trade-off Analysis
Accuracy vs Latency
Accuracy vs Latency
Question: How much accuracy can I sacrifice for speed?Check the accuracy difference between FP32 and INT8:Rule of thumb: If accuracy loss < 1% and latency gain > 20%, quantization is worthwhile.
Memory vs Throughput
Memory vs Throughput
Question: Will reduced memory improve throughput?Lower memory footprint enables:
- More concurrent inference threads
- Larger batch sizes without OOM
- Better CPU cache utilization
memory_pressure_index:- Lower values indicate better memory efficiency
- INT8 typically has 2-3x better memory pressure than FP32
Energy vs Performance
Energy vs Performance
Question: What’s the energy cost of higher throughput?Compare INT8 quantization typically:
energy_mj_proxy across scenarios:- 40-50% energy reduction vs FP32 sklearn
- 30-40% energy reduction vs FP32 ONNX
- Critical for battery-powered or large-scale deployments
Implementation Reference
Key implementation fromhardware_aware_ml/tradeoff_experiments.py:
Operator Latency Enrichment
Energy Proxy Calculation
Assumptions and Limitations
When to Re-benchmarkRe-run hardware analysis when:
- Model architecture changes
- Deploying to different hardware (e.g., cloud to edge)
- Upgrading ONNX Runtime or Python versions
- Input data distribution shifts
Best Practices
Establish Baseline
Always run FP32 ONNX as your reference baseline. It provides the best balance of accuracy and performance.
Validate Quantization
Before deploying INT8:
- Verify accuracy loss < 1% on test set
- Test edge cases and class imbalance
- Monitor for distribution shift over time
Profile in Production
Hardware analysis on dev machines may not reflect production:
- Test on actual deployment hardware
- Measure under realistic load patterns
- Account for concurrent requests and queueing
Production Deployment Checklist
Pre-deployment validation
- Run 10+ benchmark iterations on production-like hardware
- Verify accuracy within acceptable tolerance
- Test 95th percentile latency under load
- Measure memory footprint with concurrent requests
- Validate energy consumption if battery-powered
- Test cold-start latency (first inference)
- Verify ONNX Runtime version compatibility
Next Steps
Statistical Benchmarking
Learn about rigorous statistical testing and confidence intervals
Performance Tuning
Deep dive into optimization strategies and production best practices