Overview
The statistical analysis module provides tools for running repeated benchmarks, computing statistical aggregates with confidence intervals, and visualizing tradeoffs through Pareto frontier plots.run_repeated_benchmarks
Runs multiple benchmark iterations across different precision modes and computes statistical aggregates.Parameters
List of layer dimensions for the neural network.
List of activation functions for each layer.
List of precision modes to benchmark (e.g.,
["float32", "float16", "int8"]).Batch size for training and inference.
Number of samples in the dataset.
Number of training epochs per benchmark run.
Number of times to repeat each benchmark configuration.
Base random seed. Each repeat uses
seed + repeat_index.Directory for output files. Defaults to repository root +
benchmarks/statistical.Returns
List of summary statistics dictionaries, one per precision mode. Each dictionary contains:precision_mode(str): Precision moderepeats(int): Number of repeatstrain_time_mean(float): Mean training time per epoch (seconds)train_time_std(float): Standard deviation of training timetrain_time_ci95(float): 95% confidence interval for training timelatency_mean(float): Mean inference latency per sample (seconds)latency_std(float): Standard deviation of latencylatency_ci95(float): 95% confidence interval for latencymemory_mean(float): Mean peak memory usage (MB)memory_std(float): Standard deviation of memorymemory_ci95(float): 95% confidence interval for memoryaccuracy_mean(float): Mean final training accuracyaccuracy_std(float): Standard deviation of accuracyaccuracy_ci95(float): 95% confidence interval for accuracyenergy_mean(float): Mean energy per epoch (Joules)energy_std(float): Standard deviation of energyenergy_ci95(float): 95% confidence interval for energy
Output Files
The function creates the following files inoutput_dir:
- raw_runs.csv - Raw data from all benchmark runs
- summary_stats.csv - Aggregated statistics with confidence intervals
- summary_stats.json - Same summary data in JSON format
- accuracy_vs_latency.png - Scatter plot of accuracy vs latency
- accuracy_vs_energy.png - Scatter plot of accuracy vs energy
- accuracy_vs_memory.png - Scatter plot of accuracy vs memory
- pareto_frontier.png - Pareto frontier plot showing optimal tradeoffs
Example
Statistical Functions
These helper functions are used internally but can be useful for custom analysis._ci95
Computes the 95% confidence interval for a list of values. Parameters:values(list[float]): Sample values
1.96 × (std_dev / sqrt(n))
_save_csv
Saves a list of dictionaries to CSV format. Parameters:rows(list[dict]): Data rowspath(Path): Output file path
Visualization Functions
_save_tradeoff_plots
Generates scatter plots showing tradeoffs between accuracy and other metrics. Parameters:summary_rows(list[dict]): Summary statistics from benchmarksout_dir(Path): Output directory
- Accuracy vs Latency
- Accuracy vs Energy
- Accuracy vs Memory
_save_pareto_plot
Generates a Pareto frontier plot for accuracy vs latency tradeoffs. Parameters:summary_rows(list[dict]): Summary statistics from benchmarksout_dir(Path): Output directory
Understanding the Results
Confidence Intervals
The 95% confidence interval (CI) indicates that if you repeated the experiment many times, approximately 95% of the computed means would fall withinmean ± CI. Smaller CIs indicate more reliable estimates.
Pareto Frontier
The Pareto frontier shows configurations that represent optimal tradeoffs. A configuration is on the frontier if:- No other configuration has both higher accuracy AND lower latency
- It represents a meaningful choice depending on your priorities
Example Interpretation
CLI Usage
Run statistical benchmarks from command line:--repeats: Number of repetitions (default: 5)--seed: Random seed (default: 42)
Complete Workflow Example
Notes
- Uses the
benchmark_one_setupfunction from the benchmark module for individual runs - All plots are saved at 140 DPI for high quality
- Raw data is preserved in CSV format for custom post-processing
- Confidence intervals use the normal approximation (1.96 × SEM)
- Pareto frontier calculation uses non-domination criterion on latency (minimize) and accuracy (maximize)