Statistical Analysis

Overview

The statistical analysis module provides tools for running repeated benchmarks, computing statistical aggregates with confidence intervals, and visualizing tradeoffs through Pareto frontier plots.

run_repeated_benchmarks

Runs multiple benchmark iterations across different precision modes and computes statistical aggregates.

Parameters

layer_sizes

list[int]

required

List of layer dimensions for the neural network.

activations

list[str]

required

List of activation functions for each layer.

precision_modes

list[str]

required

List of precision modes to benchmark (e.g., ["float32", "float16", "int8"]).

batch_size

int

required

Batch size for training and inference.

n_samples

int

required

Number of samples in the dataset.

epochs

int

required

Number of training epochs per benchmark run.

repeats

int

required

Number of times to repeat each benchmark configuration.

seed

int

required

Base random seed. Each repeat uses seed + repeat_index.

output_dir

Path

Directory for output files. Defaults to repository root + benchmarks/statistical.

Returns

List of summary statistics dictionaries, one per precision mode. Each dictionary contains:

precision_mode (str): Precision mode
repeats (int): Number of repeats
train_time_mean (float): Mean training time per epoch (seconds)
train_time_std (float): Standard deviation of training time
train_time_ci95 (float): 95% confidence interval for training time
latency_mean (float): Mean inference latency per sample (seconds)
latency_std (float): Standard deviation of latency
latency_ci95 (float): 95% confidence interval for latency
memory_mean (float): Mean peak memory usage (MB)
memory_std (float): Standard deviation of memory
memory_ci95 (float): 95% confidence interval for memory
accuracy_mean (float): Mean final training accuracy
accuracy_std (float): Standard deviation of accuracy
accuracy_ci95 (float): 95% confidence interval for accuracy
energy_mean (float): Mean energy per epoch (Joules)
energy_std (float): Standard deviation of energy
energy_ci95 (float): 95% confidence interval for energy

Output Files

The function creates the following files in output_dir:

raw_runs.csv - Raw data from all benchmark runs
summary_stats.csv - Aggregated statistics with confidence intervals
summary_stats.json - Same summary data in JSON format
accuracy_vs_latency.png - Scatter plot of accuracy vs latency
accuracy_vs_energy.png - Scatter plot of accuracy vs energy
accuracy_vs_memory.png - Scatter plot of accuracy vs memory
pareto_frontier.png - Pareto frontier plot showing optimal tradeoffs

Example

from statistical_analysis import run_repeated_benchmarks
from pathlib import Path

summary = run_repeated_benchmarks(
    layer_sizes=[784, 128, 64, 10],
    activations=["relu", "relu", "softmax"],
    precision_modes=["float32", "float16", "int8"],
    batch_size=32,
    n_samples=1000,
    epochs=10,
    repeats=5,
    seed=42,
    output_dir=Path("benchmarks/my_analysis")
)

for stat in summary:
    mode = stat["precision_mode"]
    acc = stat["accuracy_mean"]
    acc_ci = stat["accuracy_ci95"]
    lat = stat["latency_mean"]
    print(f"{mode}: accuracy={acc:.4f}±{acc_ci:.4f}, latency={lat:.6f}s")

Statistical Functions

These helper functions are used internally but can be useful for custom analysis.

_ci95

Computes the 95% confidence interval for a list of values. Parameters:

values (list[float]): Sample values

Returns: float - Half-width of the 95% confidence interval using t-distribution approximation (1.96 × SEM) Formula: 1.96 × (std_dev / sqrt(n))

_save_csv

Saves a list of dictionaries to CSV format. Parameters:

rows (list[dict]): Data rows
path (Path): Output file path

Returns: None

Visualization Functions

_save_tradeoff_plots

Generates scatter plots showing tradeoffs between accuracy and other metrics. Parameters:

summary_rows (list[dict]): Summary statistics from benchmarks
out_dir (Path): Output directory

Returns: None Plots generated:

Accuracy vs Latency
Accuracy vs Energy
Accuracy vs Memory

_save_pareto_plot

Generates a Pareto frontier plot for accuracy vs latency tradeoffs. Parameters:

summary_rows (list[dict]): Summary statistics from benchmarks
out_dir (Path): Output directory

Returns: None Description: Identifies and plots the Pareto frontier - configurations that are not dominated by any other configuration (i.e., no other config has both better accuracy and lower latency).

Understanding the Results

Confidence Intervals

The 95% confidence interval (CI) indicates that if you repeated the experiment many times, approximately 95% of the computed means would fall within mean ± CI. Smaller CIs indicate more reliable estimates.

Pareto Frontier

The Pareto frontier shows configurations that represent optimal tradeoffs. A configuration is on the frontier if:

No other configuration has both higher accuracy AND lower latency
It represents a meaningful choice depending on your priorities

Configurations not on the frontier are strictly dominated and should generally be avoided.

Example Interpretation

summary = run_repeated_benchmarks(...)
float32_stats = summary[0]  # Assuming float32 is first

print(f"Training time: {float32_stats['train_time_mean']:.4f} "
      f"± {float32_stats['train_time_ci95']:.4f}s")
print(f"Accuracy: {float32_stats['accuracy_mean']:.2%} "
      f"± {float32_stats['accuracy_ci95']:.2%}")

CLI Usage

Run statistical benchmarks from command line:

python statistical_analysis.py --repeats 10 --seed 42

Arguments:

--repeats: Number of repetitions (default: 5)
--seed: Random seed (default: 42)

Complete Workflow Example

from statistical_analysis import run_repeated_benchmarks
from pathlib import Path
import json

# Run comprehensive benchmark
results = run_repeated_benchmarks(
    layer_sizes=[32, 64, 4],
    activations=["relu", "softmax"],
    precision_modes=["float32", "float16", "int8"],
    batch_size=32,
    n_samples=256,
    epochs=5,
    repeats=10,
    seed=123,
    output_dir=Path("results/statistical")
)

# Find best configuration for each metric
best_accuracy = max(results, key=lambda x: x["accuracy_mean"])
best_latency = min(results, key=lambda x: x["latency_mean"])
best_memory = min(results, key=lambda x: x["memory_mean"])

print("Best Configurations:")
print(f"  Accuracy: {best_accuracy['precision_mode']} "
      f"({best_accuracy['accuracy_mean']:.2%})")
print(f"  Latency: {best_latency['precision_mode']} "
      f"({best_latency['latency_mean']:.6f}s)")
print(f"  Memory: {best_memory['precision_mode']} "
      f"({best_memory['memory_mean']:.2f} MB)")

# Load raw runs for deeper analysis
with open("results/statistical/summary_stats.json") as f:
    summary = json.load(f)
    
for stat in summary:
    reliability = stat["accuracy_ci95"] / stat["accuracy_mean"]
    print(f"{stat['precision_mode']}: "
          f"relative CI = {reliability:.2%}")

Notes

Uses the benchmark_one_setup function from the benchmark module for individual runs
All plots are saved at 140 DPI for high quality
Raw data is preserved in CSV format for custom post-processing
Confidence intervals use the normal approximation (1.96 × SEM)
Pareto frontier calculation uses non-domination criterion on latency (minimize) and accuracy (maximize)

Core Components

Configuration

Training & Evaluation

Analysis Tools

CLI Scripts

Overview

run_repeated_benchmarks

Parameters

Returns

Output Files

Example

Statistical Functions

_ci95

_save_csv

Visualization Functions

_save_tradeoff_plots

_save_pareto_plot

Understanding the Results

Confidence Intervals

Pareto Frontier

Example Interpretation

CLI Usage

Complete Workflow Example

Notes

Build docs developers (and LLMs) love

Core Components

Configuration

Training & Evaluation

Analysis Tools

CLI Scripts

​Overview

​run_repeated_benchmarks

​Parameters

​Returns

​Output Files

​Example

​Statistical Functions

​_ci95

​_save_csv

​Visualization Functions

​_save_tradeoff_plots

​_save_pareto_plot

​Understanding the Results

​Confidence Intervals

​Pareto Frontier

​Example Interpretation

​CLI Usage

​Complete Workflow Example

​Notes

Build docs developers (and LLMs) love

Overview

run_repeated_benchmarks

Parameters

Returns

Output Files

Example

Statistical Functions

_ci95

_save_csv

Visualization Functions

_save_tradeoff_plots

_save_pareto_plot

Understanding the Results

Confidence Intervals

Pareto Frontier

Example Interpretation

CLI Usage

Complete Workflow Example

Notes