Benchmarking API

Overview

The benchmark module provides utilities for running repeated benchmarks with statistical confidence intervals and analyzing tabular metrics.

Data Classes

BenchmarkResult

Contains the results of a repeated benchmark run with confidence intervals.

@dataclass
class BenchmarkResult:
    metric_mean: float
    metric_std: float
    metric_ci_margin: float
    latency_mean_ms: float
    latency_std_ms: float
    latency_ci_margin_ms: float
    runs: int
    confidence_level: float

metric_mean

float

Mean value of the measured metric across all runs

metric_std

float

Standard deviation of the metric

metric_ci_margin

float

Confidence interval margin for the metric

latency_mean_ms

float

Mean latency in milliseconds across all runs

latency_std_ms

float

Standard deviation of latency in milliseconds

latency_ci_margin_ms

float

Confidence interval margin for latency in milliseconds

runs

int

Number of benchmark runs executed

confidence_level

float

Confidence level used for interval calculation (e.g., 0.95 for 95%)

Functions

run_repeated_benchmark

Executes a function multiple times and returns benchmark statistics with confidence intervals.

def run_repeated_benchmark(
    run_fn,
    metric_key: str,
    runs: int = 5,
    confidence: float = 0.95
) -> BenchmarkResult

run_fn

callable

required

Function to benchmark. Must return a dictionary containing the metric specified by metric_key

metric_key

str

required

Key to extract from the function’s return dictionary for metric tracking

runs

int

default:"5"

Number of times to run the function. Minimum of 2 runs will be enforced

confidence

float

default:"0.95"

Confidence level for interval calculation (0-1)

return

BenchmarkResult

A BenchmarkResult object containing mean, standard deviation, and confidence interval margins for both the metric and latency measurements

Example:

from evaluation.benchmark import run_repeated_benchmark

def my_model():
    # Run your model
    return {"accuracy": 0.95, "f1": 0.92}

result = run_repeated_benchmark(
    run_fn=my_model,
    metric_key="accuracy",
    runs=10,
    confidence=0.95
)

print(f"Accuracy: {result.metric_mean:.3f} ± {result.metric_ci_margin:.3f}")
print(f"Latency: {result.latency_mean_ms:.2f} ± {result.latency_ci_margin_ms:.2f} ms")

benchmark_table_metrics

Computes confidence intervals for multiple metric columns in a DataFrame.

def benchmark_table_metrics(
    df,
    metric_columns: list[str],
    confidence: float = 0.95
) -> dict[str, dict[str, float]]

pandas.DataFrame

required

DataFrame containing the metric columns to analyze

metric_columns

list[str]

required

List of column names to compute statistics for

confidence

float

default:"0.95"

Confidence level for interval calculation (0-1)

return

dict[str, dict[str, float]]

Dictionary mapping each metric column name to a dictionary containing:

mean: Mean value
std: Standard deviation
ci_margin: Confidence interval margin
confidence: Confidence level used

Example:

import pandas as pd
from evaluation.benchmark import benchmark_table_metrics

df = pd.DataFrame({
    "accuracy": [0.95, 0.94, 0.96, 0.95],
    "f1_score": [0.92, 0.91, 0.93, 0.92]
})

summary = benchmark_table_metrics(
    df,
    metric_columns=["accuracy", "f1_score"],
    confidence=0.95
)

for metric, stats in summary.items():
    print(f"{metric}: {stats['mean']:.3f} ± {stats['ci_margin']:.3f}")

CLI Commands

Data Modules

Models

Real-time

Deployment

Evaluation

Utilities

Overview

Data Classes

BenchmarkResult

Functions

run_repeated_benchmark

benchmark_table_metrics

Build docs developers (and LLMs) love

CLI Commands

Data Modules

Models

Real-time

Deployment

Evaluation

Utilities

​Overview

​Data Classes

​BenchmarkResult

​Functions

​run_repeated_benchmark

​benchmark_table_metrics

Build docs developers (and LLMs) love

Overview

Data Classes

BenchmarkResult

Functions

run_repeated_benchmark

benchmark_table_metrics