Overview
The statistical benchmarking framework provides repeatable, statistically rigorous performance measurements across multiple deployment scenarios. It uses bootstrap confidence intervals and paired statistical tests to compare model performance with quantified uncertainty.Quick Start
Run the statistical benchmark with default parameters:Command-Line Arguments
Required Parameters
Number of benchmark runs to execute. Must be >= 2 for meaningful variance calculation and paired comparisons. More runs provide better statistical power but increase execution time.
Number of samples per benchmark run. Use a consistent batch size across experiments for valid comparisons.
Optional Parameters
Number of warmup iterations before timed runs. Warmup runs allow JIT compilation and cache warming without affecting measurements.
Confidence level for bootstrap confidence intervals (typically 0.95 or 0.99).
Number of bootstrap resamples for confidence interval estimation. Higher values provide more stable intervals.
Benchmark Scenarios
The framework automatically benchmarks three deployment scenarios:sklearn_fp32
Native scikit-learn model with 32-bit floating point precision. Baseline Python implementation.
onnx_fp32
ONNX Runtime with 32-bit floating point. Optimized graph execution with CPU kernels.
onnx_int8
ONNX Runtime with 8-bit integer quantization. Reduced precision for faster inference and lower memory footprint.
Metrics Tracked
Performance Metrics
- Latency: Mean milliseconds per sample
- Throughput: Samples processed per second
- Accuracy: Prediction accuracy on test data
System Metrics
- Memory: Process RSS memory delta (requires
psutil) - CPU Utilization: Average CPU percentage during inference
- Energy: Microjoules consumed (RAPL counters when available on Intel systems)
Output Artifacts
The benchmark generates four output files in theartifacts/ directory:
1. stat_benchmark_runs.csv
Raw measurements from all runs:2. stat_benchmark_summary.csv
Bootstrap confidence intervals for each scenario:*_mean: Average value across runs*_ci_low: Lower bound of bootstrap confidence interval*_ci_high: Upper bound of bootstrap confidence interval
3. statistical_comparisons.csv
Paired statistical tests comparing each scenario toonnx_fp32 baseline:
- mean_diff: Average difference (scenario - baseline). Negative means scenario is faster/better.
- p_value: Statistical significance. Values < 0.05 indicate significant differences.
- cohens_d_paired: Effect size. |d| > 0.8 is large, 0.5-0.8 medium, < 0.5 small.
4. stat_benchmark_report.json
Experiment metadata and configuration:Statistical Methods
Bootstrap Confidence Intervals
The framework uses bootstrap resampling to estimate confidence intervals:- Non-parametric (no distribution assumptions)
- Quantifies uncertainty in mean estimates
- Valid for small sample sizes
Paired t-tests
For scenario comparisons, the framework uses paired t-tests:- Same test data and hardware for all runs
- Higher statistical power by controlling for run-to-run variation
- Detects smaller performance differences
Effect Size (Cohen’s d)
Effect size quantifies the practical magnitude of differences:- |d| < 0.5: Small effect (may not be practically significant)
- 0.5 ≤ |d| < 0.8: Medium effect
- |d| ≥ 0.8: Large effect (likely meaningful in production)
Visualization Dashboard
Generate visual summaries and composite scores:Outputs
benchmark_summary.json
Best-performing scenarios by dimension:
best_composite: Highest overall scorelowest_latency: Fastest inferencelowest_memory: Smallest footprintlowest_energy: Most energy-efficient
Best Practices
Fixed seeds for reproducibility
Fixed seeds for reproducibility
The framework uses
np.random.default_rng(42) for bootstrap sampling. Use consistent random seeds in your model training pipeline for end-to-end reproducibility.Minimum run requirements
Minimum run requirements
- Require at least 5 runs for preliminary analysis
- Use 10+ runs for statistical comparisons
- Run 30+ iterations for publication-quality benchmarks
Interpreting results
Interpreting results
Look at both central tendency (mean) and dispersion (confidence intervals):
- Wide confidence intervals suggest high variability
- Check if confidence intervals overlap between scenarios
- Don’t rank scenarios solely on mean values
Hardware consistency
Hardware consistency
- Run all scenarios on the same hardware
- Minimize background processes during benchmarking
- Use consistent power settings (disable CPU frequency scaling if possible)
Quality Gates
Recommended thresholds for release decisions:- p-value < 0.05: Difference is statistically significant
- |Cohen’s d| > 0.5: Difference is practically meaningful
- Non-overlapping CIs: Strong evidence of performance gap
- Throughput delta > 10%: Likely worth productionizing
Assumptions and Limitations
Implementation Reference
Key implementation details frombenchmarking/statistical_benchmark.py:
Single Run Execution
RAPL Energy Profiling
Next Steps
Hardware-Aware Optimization
Analyze hardware-level trade-offs with detailed telemetry
Performance Tuning
Learn strategies for interpreting and acting on benchmark results