Quick Start
Command Line Usage
Run repeated benchmarks with statistical analysis:Programmatic Usage
Statistical Metrics
Aggregated Metrics
Each precision mode reports mean, standard deviation, and 95% confidence interval:- train_time_mean: Mean training time per epoch
- train_time_std: Standard deviation
- train_time_ci95: 95% confidence interval (±)
- latency_mean: Mean inference latency per sample
- latency_std: Standard deviation
- latency_ci95: 95% confidence interval
- memory_mean: Mean peak memory usage
- memory_std: Standard deviation
- memory_ci95: 95% confidence interval
- accuracy_mean: Mean final training accuracy
- accuracy_std: Standard deviation
- accuracy_ci95: 95% confidence interval
- energy_mean: Mean energy per epoch
- energy_std: Standard deviation
- energy_ci95: 95% confidence interval
Confidence Intervals
The system computes 95% confidence intervals using the t-distribution:- Narrow CI: Low variance, consistent results
- Wide CI: High variance, inconsistent performance
Example Output
Output Files
Results are saved tobenchmarks/statistical/:
raw_runs.csv
Individual benchmark runs:summary_stats.csv
Aggregated statistics:summary_stats.json
JSON format for programmatic access:Visualization
Accuracy vs Latency
Plot trade-offs between accuracy and inference latency:
Accuracy vs Energy
Plot trade-offs between accuracy and energy consumption:
Accuracy vs Memory
Plot trade-offs between accuracy and memory usage:
Pareto Frontier Analysis
Identify optimal configurations on the Pareto frontier:
Pareto Optimality
A configuration is Pareto-optimal if no other configuration is better in all objectives:Interpreting the Pareto Frontier
- On the frontier: Optimal trade-off (cannot improve one metric without degrading another)
- Below the frontier: Suboptimal (other configurations are strictly better)
- Above the frontier: Theoretically ideal (but unachievable with current configurations)
Advanced Usage
Custom Output Directory
Analyzing Results
Comparing Multiple Runs
Reproducibility
Each repeat uses a different seed:Choosing Number of Repeats
| Repeats | Use Case | CI Accuracy |
|---|---|---|
| 3-5 | Quick validation | Low |
| 5-10 | Development testing | Medium |
| 10-20 | Production analysis | High |
| 20+ | Research/publication | Very high |
Best Practices
1. Use Sufficient Repeats
2. Check Confidence Intervals
3. Validate Pareto Frontier
4. Report Statistics Properly
Integration with Other Tools
With Benchmarking
With Reproducibility
Next Steps
Benchmarking
Run single benchmark configurations
Hardware Simulation
Test under hardware constraints