Skip to main content

Overview

The hardware-aware ML optimization module analyzes deployment scenarios across multiple dimensions: latency, throughput, accuracy, memory footprint, CPU utilization, and energy consumption. It enriches benchmark data with operator-level profiling and hardware efficiency metrics.

Prerequisites

Run statistical benchmarks first to generate raw measurement data:
python benchmarking/statistical_benchmark.py --runs 10 --batch-size 256

Quick Start

Analyze hardware trade-offs:
python hardware_aware_ml/tradeoff_experiments.py
This reads artifacts/stat_benchmark_runs.csv and generates enriched hardware analysis.

Deployment Scenarios

The framework compares three production-ready deployment paths:

FP32 sklearn

Native PythonStandard scikit-learn model with 32-bit floats. Easiest to deploy but typically slowest.
  • Simple serialization with joblib
  • No additional dependencies
  • Baseline for comparison

FP32 ONNX

Optimized RuntimeONNX Runtime with 32-bit precision. Optimized graph execution with vectorized CPU kernels.
  • 2-3x faster than sklearn
  • Same accuracy as training
  • Cross-platform compatibility

INT8 ONNX

Quantized InferenceONNX Runtime with 8-bit integer quantization. Reduced memory and compute at slight accuracy cost.
  • 3-5x faster than sklearn
  • ~75% smaller memory footprint
  • Possible accuracy degradation (< 1% typical)

Metrics Tracked

Core Performance Metrics

latency_ms_per_sample
float
Mean inference time per sample in milliseconds. Lower is better. Critical for real-time applications.
throughput_samples_per_sec
float
Number of samples processed per second. Higher is better. Important for batch processing.
accuracy
float
Prediction accuracy on test data. Quantization may reduce accuracy slightly (monitor < 1% degradation).

System Resource Metrics

memory_mb
float
Process resident set size (RSS) in megabytes. Measured via psutil as memory delta before/after inference.
cpu_percent_avg
float
Average CPU utilization percentage during inference. Helps identify CPU-bound workloads.
energy_uj_measured
float
Energy consumed in microjoules, measured via Intel RAPL counters (when available on Intel CPUs).
energy_mj_proxy
float
Proxy energy estimate (latency × CPU utilization). Used when RAPL counters unavailable.

Energy Profiling

RAPL Counters (Intel CPUs)

When running on Intel systems with RAPL support, the framework reads hardware energy counters:
def _rapl_uj() -> float | None:
    """Read Intel RAPL energy counter in microjoules."""
    rapl = Path("/sys/class/powercap/intel-rapl:0/energy_uj")
    if rapl.exists():
        try:
            return float(rapl.read_text(encoding="utf-8").strip())
        except OSError:
            return None
    return None
RAPL Availability: Requires Intel CPU with RAPL support and read permissions on /sys/class/powercap/. Not available on AMD CPUs or non-Linux systems.

Proxy Energy Estimate

When RAPL counters unavailable, the framework uses a proxy metric:
energy_mj_proxy = latency_ms_per_sample * cpu_percent_avg
This correlates well with actual energy but should not be compared across different hardware.

Operator-Level Profiling

The framework enriches metrics with estimated operator-level latency breakdowns:
OPERATOR_SPLIT = {
    "sklearn_fp32": {"preprocess": 0.22, "linear": 0.58, "postprocess": 0.20},
    "onnx_fp32":    {"preprocess": 0.18, "linear": 0.63, "postprocess": 0.19},
    "onnx_int8":    {"preprocess": 0.20, "linear": 0.55, "postprocess": 0.25},
}

Latency Decomposition

1

Preprocessing

Input validation, type conversion, and feature scaling. Typically 18-22% of total latency.
2

Linear Operator

Core inference computation (matrix multiplication, activation). Dominant cost at 55-63% of latency.
3

Postprocessing

Output conversion, probability calculation, and thresholding. Typically 19-25% of latency.
Use case: Identify optimization opportunities. If preprocessing dominates, optimize data loading. If linear operator is slow, consider quantization or model compression.

Output Artifacts

hardware_tradeoffs.csv

Comprehensive hardware analysis with enriched metrics:
scenario,accuracy,latency_ms_per_sample,throughput_samples_per_sec,memory_mb,cpu_percent_avg,energy_mj_proxy,preprocess_latency_ms,linear_operator_latency_ms,postprocess_latency_ms,estimated_effective_bandwidth_gbps,memory_pressure_index,quantization_note
sklearn_fp32,0.856,0.121,8266.5,4.15,12.4,1.50,0.027,0.070,0.024,34.3,0.0005,fp32_path
onnx_fp32,0.856,0.095,10526.3,3.22,10.8,1.03,0.017,0.060,0.018,33.9,0.0003,fp32_path
onnx_int8,0.854,0.068,14705.9,2.41,8.2,0.56,0.014,0.037,0.017,35.4,0.0002,int8_path

Enriched Metrics

preprocess_latency_ms
float
Estimated time spent in preprocessing phase.
linear_operator_latency_ms
float
Estimated time spent in core model computation.
postprocess_latency_ms
float
Estimated time spent in output processing.
estimated_effective_bandwidth_gbps
float
Estimated memory bandwidth utilization in GB/s. Calculated as:
bandwidth = (memory_mb * 1024 * 1024) / (latency_ms / 1000) / 1e9
memory_pressure_index
float
Memory usage per unit throughput. Lower indicates better memory efficiency:
memory_pressure = memory_mb / throughput_samples_per_sec
quantization_note
string
Indicates whether scenario uses INT8 quantization or FP32 precision.

hardware_tradeoffs_summary.json

Best-performing scenarios by optimization dimension:
{
  "best_accuracy": {
    "scenario": "onnx_fp32",
    "accuracy": 0.856,
    "latency_ms_per_sample": 0.095,
    "memory_mb": 3.22
  },
  "best_energy_proxy": {
    "scenario": "onnx_int8",
    "energy_mj_proxy": 0.56,
    "accuracy": 0.854,
    "latency_ms_per_sample": 0.068
  },
  "best_latency": {
    "scenario": "onnx_int8",
    "latency_ms_per_sample": 0.068,
    "throughput_samples_per_sec": 14705.9
  }
}

Interpretation Guide

Scenario Selection Matrix

Recommended: INT8 ONNXWhen real-time response is critical:
  • User-facing APIs with < 100ms SLA
  • High-frequency trading
  • Online recommendation systems
✓ Lowest latency per sample
✓ Highest throughput
⚠ Slight accuracy degradation (< 1%)

Trade-off Analysis

Question: How much accuracy can I sacrifice for speed?Check the accuracy difference between FP32 and INT8:
accuracy_loss = fp32_accuracy - int8_accuracy
latency_gain = (fp32_latency - int8_latency) / fp32_latency * 100

# Typical results:
# accuracy_loss: 0.002 (0.2%)
# latency_gain: 28% faster
Rule of thumb: If accuracy loss < 1% and latency gain > 20%, quantization is worthwhile.
Question: Will reduced memory improve throughput?Lower memory footprint enables:
  • More concurrent inference threads
  • Larger batch sizes without OOM
  • Better CPU cache utilization
Check memory_pressure_index:
  • Lower values indicate better memory efficiency
  • INT8 typically has 2-3x better memory pressure than FP32
Question: What’s the energy cost of higher throughput?Compare energy_mj_proxy across scenarios:
energy_per_1000_samples = energy_mj_proxy * 1000
INT8 quantization typically:
  • 40-50% energy reduction vs FP32 sklearn
  • 30-40% energy reduction vs FP32 ONNX
  • Critical for battery-powered or large-scale deployments

Implementation Reference

Key implementation from hardware_aware_ml/tradeoff_experiments.py:

Operator Latency Enrichment

def _enrich_hardware_estimates(grouped: pd.DataFrame) -> pd.DataFrame:
    records = []
    for _, row in grouped.iterrows():
        scenario = row["scenario"]
        split = OPERATOR_SPLIT.get(scenario, OPERATOR_SPLIT["onnx_fp32"])
        latency = float(row["latency_ms_per_sample"])
        memory_mb = max(float(row["memory_mb"]), 1e-6)

        # Decompose latency by operator
        preprocess_latency = latency * split["preprocess"]
        linear_latency = latency * split["linear"]
        postprocess_latency = latency * split["postprocess"]

        # Estimate effective bandwidth
        inferred_bytes = memory_mb * 1024 * 1024
        effective_bandwidth_gbps = (inferred_bytes / max(latency / 1000.0, 1e-9)) / 1e9

        records.append({
            "scenario": scenario,
            "preprocess_latency_ms": preprocess_latency,
            "linear_operator_latency_ms": linear_latency,
            "postprocess_latency_ms": postprocess_latency,
            "estimated_effective_bandwidth_gbps": effective_bandwidth_gbps,
            "memory_pressure_index": memory_mb / max(float(row["throughput_samples_per_sec"]), 1e-9),
            "quantization_note": "int8_path" if "int8" in scenario else "fp32_path",
        })

    return grouped.merge(pd.DataFrame(records), on="scenario", how="left")

Energy Proxy Calculation

# Calculate proxy energy when RAPL unavailable
grouped["energy_mj_proxy"] = grouped["latency_ms_per_sample"] * grouped["cpu_percent_avg"]

# Compare proxy vs measured
grouped["energy_delta_vs_measured"] = np.where(
    grouped["energy_uj_measured"] > 0,
    grouped["energy_mj_proxy"] - (grouped["energy_uj_measured"] / 1000),
    np.nan,
)

Assumptions and Limitations

Hardware Dependencies
  • Energy counters are host-dependent. RAPL only available on Intel CPUs with Linux.
  • Proxy estimates are used when RAPL unavailable but cannot be compared across machines.
  • Memory numbers are process-level approximations, not full system attribution.
  • Operator-level splits are estimated from profiling, not exact measurements.
Cross-Machine ComparisonsRequire normalized conditions:
  • Same batch size and input data
  • Same ONNX Runtime and Python versions
  • Similar CPU generation and clock speed
  • Consistent power management settings
When to Re-benchmarkRe-run hardware analysis when:
  • Model architecture changes
  • Deploying to different hardware (e.g., cloud to edge)
  • Upgrading ONNX Runtime or Python versions
  • Input data distribution shifts

Best Practices

1

Establish Baseline

Always run FP32 ONNX as your reference baseline. It provides the best balance of accuracy and performance.
2

Validate Quantization

Before deploying INT8:
  • Verify accuracy loss < 1% on test set
  • Test edge cases and class imbalance
  • Monitor for distribution shift over time
3

Profile in Production

Hardware analysis on dev machines may not reflect production:
  • Test on actual deployment hardware
  • Measure under realistic load patterns
  • Account for concurrent requests and queueing
4

Monitor Energy Over Time

Energy consumption changes with:
  • Model updates and retraining
  • Batch size and traffic patterns
  • Hardware aging and thermal throttling

Production Deployment Checklist

Pre-deployment validation
  • Run 10+ benchmark iterations on production-like hardware
  • Verify accuracy within acceptable tolerance
  • Test 95th percentile latency under load
  • Measure memory footprint with concurrent requests
  • Validate energy consumption if battery-powered
  • Test cold-start latency (first inference)
  • Verify ONNX Runtime version compatibility

Next Steps

Statistical Benchmarking

Learn about rigorous statistical testing and confidence intervals

Performance Tuning

Deep dive into optimization strategies and production best practices

Build docs developers (and LLMs) love