Benchmarking

Overview

The Hospital Data Analysis Platform provides robust benchmarking utilities with statistical confidence intervals. These tools enable reliable performance measurements that account for variance and provide confidence bounds.

BenchmarkResult Dataclass

Benchmark results are returned as a BenchmarkResult dataclass (defined in evaluation/benchmark.py:10):

@dataclass
class BenchmarkResult:
    metric_mean: float              # Mean of the target metric
    metric_std: float               # Standard deviation of the metric
    metric_ci_margin: float         # Confidence interval margin
    latency_mean_ms: float          # Mean latency in milliseconds
    latency_std_ms: float           # Standard deviation of latency
    latency_ci_margin_ms: float     # Confidence interval margin for latency
    runs: int                       # Number of benchmark iterations
    confidence_level: float         # Confidence level (e.g., 0.95)

Interpreting Results

The confidence interval is expressed as mean ± margin:

Metric: metric_mean ± metric_ci_margin
Latency: latency_mean_ms ± latency_ci_margin_ms

For a 95% confidence level, we can be 95% confident the true value lies within this range.

Repeated Benchmarks

Basic Usage

The run_repeated_benchmark function (defined in evaluation/benchmark.py:21) runs a function multiple times and computes statistics:

from evaluation.benchmark import run_repeated_benchmark
from config import CONFIG

def train_and_evaluate():
    # Your training and evaluation logic
    model = train_model()
    metrics = evaluate_model(model)
    return metrics  # Must be a dict with the metric_key

result = run_repeated_benchmark(
    run_fn=train_and_evaluate,
    metric_key="accuracy",
    runs=CONFIG.benchmark_runs,
    confidence=CONFIG.confidence_level
)

print(f"Accuracy: {result.metric_mean:.3f} ± {result.metric_ci_margin:.3f}")
print(f"Latency: {result.latency_mean_ms:.1f} ± {result.latency_ci_margin_ms:.1f} ms")

Function Signature

def run_repeated_benchmark(
    run_fn,                    # Callable that returns a dict with metrics
    metric_key: str,           # Key to extract from returned dict
    runs: int = 5,             # Number of iterations (minimum 2)
    confidence: float = 0.95   # Confidence level (0.90, 0.95, or 0.99)
) -> BenchmarkResult:

Implementation Details

def run_repeated_benchmark(run_fn, metric_key: str, runs: int = 5, confidence: float = 0.95) -> BenchmarkResult:
    effective_runs = max(2, int(runs))
    metrics = []
    latencies = []
    
    for _ in range(effective_runs):
        start = time.perf_counter()
        result = run_fn()
        elapsed_ms = (time.perf_counter() - start) * 1000
        metrics.append(float(result[metric_key]))
        latencies.append(elapsed_ms)
    
    m_mean, m_std, m_ci = confidence_interval(metrics, confidence=confidence)
    l_mean, l_std, l_ci = confidence_interval(latencies, confidence=confidence)
    return BenchmarkResult(m_mean, m_std, m_ci, l_mean, l_std, l_ci, effective_runs, confidence)

Key features:

Uses time.perf_counter() for high-resolution timing
Ensures minimum 2 runs for statistical validity
Computes confidence intervals for both metrics and latency

Table Metrics Benchmarking

Multi-Column Analysis

The benchmark_table_metrics function (defined in evaluation/benchmark.py:37) computes statistics across multiple DataFrame columns:

import pandas as pd
from evaluation.benchmark import benchmark_table_metrics
from config import CONFIG

# Load experiment results
df = pd.read_csv(CONFIG.output_dir / "experiment_results.csv")

# Compute statistics for multiple metrics
summary = benchmark_table_metrics(
    df=df,
    metric_columns=["accuracy", "precision", "recall", "f1_score"],
    confidence=CONFIG.confidence_level
)

for metric, stats in summary.items():
    print(f"{metric}: {stats['mean']:.3f} ± {stats['ci_margin']:.3f}")

Function Signature

def benchmark_table_metrics(
    df,                           # pandas DataFrame with results
    metric_columns: list[str],    # Column names to analyze
    confidence: float = 0.95      # Confidence level
) -> dict[str, dict[str, float]]:

Return Format

{
    "accuracy": {
        "mean": 0.847,
        "std": 0.023,
        "ci_margin": 0.018,
        "confidence": 0.95
    },
    "precision": {
        "mean": 0.832,
        "std": 0.031,
        "ci_margin": 0.024,
        "confidence": 0.95
    },
    # ... other metrics
}

Statistical Confidence Intervals

Implementation

The confidence_interval function (defined in evaluation/statistics.py:7) computes mean, standard deviation, and margin:

import math
import numpy as np

def confidence_interval(values: list[float], confidence: float = 0.95) -> tuple[float, float, float]:
    arr = np.array(values, dtype=float)
    mean = float(arr.mean())
    std = float(arr.std(ddof=1)) if len(arr) > 1 else 0.0
    
    if len(arr) <= 1:
        return mean, std, 0.0
    
    z_map = {0.90: 1.64, 0.95: 1.96, 0.99: 2.58}
    z = z_map.get(round(confidence, 2), 1.96)
    margin = z * std / math.sqrt(len(arr))
    return mean, std, float(margin)

Supported Confidence Levels

Confidence Level	Z-Score
0.90 (90%)	1.64
0.95 (95%)	1.96
0.99 (99%)	2.58

Note: If an unsupported confidence level is provided, the function defaults to 1.96 (95% confidence).

Direct Usage

from evaluation.statistics import confidence_interval

latencies = [102.3, 98.7, 105.1, 99.8, 103.2]
mean, std, margin = confidence_interval(latencies, confidence=0.95)

print(f"Latency: {mean:.1f} ± {margin:.1f} ms")
print(f"Range: [{mean - margin:.1f}, {mean + margin:.1f}] ms")

Complete Benchmarking Example

End-to-End Performance Evaluation

import pandas as pd
from pathlib import Path
from config import CONFIG
from utils.reproducibility import set_global_seed
from evaluation.benchmark import run_repeated_benchmark, benchmark_table_metrics

def train_and_evaluate():
    """Train a model and return evaluation metrics."""
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score, precision_score
    
    # Generate synthetic data
    X, y = make_classification(n_samples=1000, n_features=10, random_state=CONFIG.random_seed)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=CONFIG.test_size)
    
    # Train model
    model = RandomForestClassifier(random_state=CONFIG.random_seed)
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test)
    return {
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred, average="weighted")
    }

def main():
    # Set seed for reproducibility
    set_global_seed(CONFIG.random_seed)
    
    print(f"Running benchmark with {CONFIG.benchmark_runs} iterations...\n")
    
    # Run repeated benchmark
    result = run_repeated_benchmark(
        run_fn=train_and_evaluate,
        metric_key="accuracy",
        runs=CONFIG.benchmark_runs,
        confidence=CONFIG.confidence_level
    )
    
    # Display results
    print(f"Accuracy: {result.metric_mean:.4f} ± {result.metric_ci_margin:.4f}")
    print(f"  Standard deviation: {result.metric_std:.4f}")
    print(f"  Confidence level: {result.confidence_level * 100:.0f}%")
    print(f"  {result.runs} runs\n")
    
    print(f"Latency: {result.latency_mean_ms:.2f} ± {result.latency_ci_margin_ms:.2f} ms")
    print(f"  Standard deviation: {result.latency_std_ms:.2f} ms")
    print(f"  Range: [{result.latency_mean_ms - result.latency_ci_margin_ms:.2f}, "
          f"{result.latency_mean_ms + result.latency_ci_margin_ms:.2f}] ms")

if __name__ == "__main__":
    main()

Multi-Experiment Analysis

import pandas as pd
from config import CONFIG
from evaluation.benchmark import benchmark_table_metrics

# Simulate running experiments with different configurations
experiments = []

for memory_limit in CONFIG.experiment_memory_limits_mb:
    for compute_budget in CONFIG.experiment_compute_budgets:
        # Run experiment with specific constraints
        result = {
            "memory_limit_mb": memory_limit,
            "compute_budget": compute_budget,
            "accuracy": 0.85 + (memory_limit / 1000) + (compute_budget / 100000),
            "latency_ms": 100 + (1000 / memory_limit) + (100000 / compute_budget)
        }
        experiments.append(result)

# Convert to DataFrame
df = pd.DataFrame(experiments)

# Compute statistics
summary = benchmark_table_metrics(
    df=df,
    metric_columns=["accuracy", "latency_ms"],
    confidence=0.95
)

print("Experiment Summary:\n")
for metric, stats in summary.items():
    print(f"{metric}:")
    print(f"  Mean: {stats['mean']:.4f}")
    print(f"  Std: {stats['std']:.4f}")
    print(f"  95% CI: ± {stats['ci_margin']:.4f}")
    print(f"  Range: [{stats['mean'] - stats['ci_margin']:.4f}, "
          f"{stats['mean'] + stats['ci_margin']:.4f}]\n")

Best Practices

1. Use Sufficient Iterations

More benchmark runs provide tighter confidence intervals:

# Minimal (fast but wide intervals)
result = run_repeated_benchmark(run_fn, "accuracy", runs=5)

# Recommended (balanced)
result = run_repeated_benchmark(run_fn, "accuracy", runs=10)

# High precision (slow but tight intervals)
result = run_repeated_benchmark(run_fn, "accuracy", runs=30)

2. Control for Variance

Reduce variance by:

Setting a consistent random seed
Running benchmarks on idle systems
Disabling background processes
Using the same hardware and environment

3. Report Complete Statistics

Always report mean, standard deviation, and confidence interval:

print(f"Accuracy: {result.metric_mean:.3f} ± {result.metric_ci_margin:.3f} "
      f"(std: {result.metric_std:.3f}, n={result.runs}, CI={result.confidence_level*100:.0f}%)")

4. Save Benchmark Results

import json

result_dict = {
    "metric_mean": result.metric_mean,
    "metric_ci": [result.metric_mean - result.metric_ci_margin, 
                  result.metric_mean + result.metric_ci_margin],
    "latency_mean_ms": result.latency_mean_ms,
    "latency_ci_ms": [result.latency_mean_ms - result.latency_ci_margin_ms,
                      result.latency_mean_ms + result.latency_ci_margin_ms],
    "runs": result.runs,
    "confidence_level": result.confidence_level
}

with open(CONFIG.output_dir / "benchmark_results.json", "w") as f:
    json.dump(result_dict, f, indent=2)

Interpreting Benchmarks

Comparing Results

Two benchmarks are statistically different if their confidence intervals don’t overlap:

result_a = run_repeated_benchmark(baseline_fn, "accuracy", runs=10)
result_b = run_repeated_benchmark(optimized_fn, "accuracy", runs=10)

# Check for overlap
ci_a_lower = result_a.metric_mean - result_a.metric_ci_margin
ci_a_upper = result_a.metric_mean + result_a.metric_ci_margin
ci_b_lower = result_b.metric_mean - result_b.metric_ci_margin
ci_b_upper = result_b.metric_mean + result_b.metric_ci_margin

if ci_a_upper < ci_b_lower or ci_b_upper < ci_a_lower:
    print("Results are statistically different")
else:
    print("Results overlap - difference may not be significant")

Performance Regression Detection

def check_regression(current_result, baseline_result, tolerance=0.05):
    """Check if performance has regressed beyond tolerance."""
    
    # Use lower bound of current vs upper bound of baseline
    current_lower = current_result.metric_mean - current_result.metric_ci_margin
    baseline_upper = baseline_result.metric_mean + baseline_result.metric_ci_margin
    
    regression = (baseline_upper - current_lower) / baseline_upper
    
    if regression > tolerance:
        print(f"WARNING: Performance regression detected ({regression*100:.1f}%)")
        return True
    return False

Getting Started

Core Concepts

Data Pipeline

Modeling

Real-time Processing

Deployment

Operations

Overview

BenchmarkResult Dataclass

Interpreting Results

Repeated Benchmarks

Basic Usage

Function Signature

Implementation Details

Table Metrics Benchmarking

Multi-Column Analysis

Function Signature

Return Format

Statistical Confidence Intervals

Implementation

Supported Confidence Levels

Direct Usage

Complete Benchmarking Example

End-to-End Performance Evaluation

Multi-Experiment Analysis

Best Practices

1. Use Sufficient Iterations

2. Control for Variance

3. Report Complete Statistics

4. Save Benchmark Results

Interpreting Benchmarks

Comparing Results

Performance Regression Detection

See Also

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Data Pipeline

Modeling

Real-time Processing

Deployment

Operations

​Overview

​BenchmarkResult Dataclass

​Interpreting Results

​Repeated Benchmarks

​Basic Usage

​Function Signature

​Implementation Details

​Table Metrics Benchmarking

​Multi-Column Analysis

​Function Signature

​Return Format

​Statistical Confidence Intervals

​Implementation

​Supported Confidence Levels

​Direct Usage

​Complete Benchmarking Example

​End-to-End Performance Evaluation

​Multi-Experiment Analysis

​Best Practices

​1. Use Sufficient Iterations

​2. Control for Variance

​3. Report Complete Statistics

​4. Save Benchmark Results

​Interpreting Benchmarks

​Comparing Results

​Performance Regression Detection

​See Also

Build docs developers (and LLMs) love

Overview

BenchmarkResult Dataclass

Interpreting Results

Repeated Benchmarks

Basic Usage

Function Signature

Implementation Details

Table Metrics Benchmarking

Multi-Column Analysis

Function Signature

Return Format

Statistical Confidence Intervals

Implementation

Supported Confidence Levels

Direct Usage

Complete Benchmarking Example

End-to-End Performance Evaluation

Multi-Experiment Analysis

Best Practices

1. Use Sufficient Iterations

2. Control for Variance

3. Report Complete Statistics

4. Save Benchmark Results

Interpreting Benchmarks

Comparing Results

Performance Regression Detection

See Also