Skip to main content

Overview

The metrics module implements proper scoring rules and statistical tests to evaluate probabilistic predictions. These metrics quantify model calibration, discrimination ability, and serial independence.
Proper Scoring Rules: Both Brier Score and Log Loss are “proper” — they are minimized when the forecaster reports their true beliefs. This property is critical for honest calibration.

Brier Score

Measures: Overall prediction error (calibration + discrimination)
Range: 0 (perfect) to 1 (worst)
Baseline: 0.25 (random coin flip, always predict 50%)

Formula

Brier Score = (1/N) × Σ(p_i - o_i)²

where:
  p_i = predicted probability
  o_i = actual outcome (0 or 1)
  N   = number of predictions

Implementation

export function brierScore(data) {
  if (data.length === 0) return NaN
  let sum = 0
  for (const { predicted, outcome } of data) {
    sum += (predicted - outcome) ** 2
  }
  return sum / data.length
}
/**
 * Brier Score: (1/N) * sum((p_i - o_i)^2)
 * Perfect = 0, Random (always 0.5) = 0.25, Worst = 1.0
 * @param {Array<{predicted: number, outcome: 0|1}>} data
 * @returns {number}
 */
export function brierScore(data) {
  if (data.length === 0) return NaN
  let sum = 0
  for (const { predicted, outcome } of data) {
    sum += (predicted - outcome) ** 2
  }
  return sum / data.length
}

Interpretation

Brier ScoreQualityInterpretation
0.00 - 0.10ExcellentVery well calibrated
0.10 - 0.20GoodUseful signal
0.20 - 0.25FairBarely better than chance
> 0.25PoorWorse than random

Example

import { brierScore } from './engine/metrics.js'

const predictions = [
  { predicted: 0.70, outcome: 1 },  // Correct, confident
  { predicted: 0.45, outcome: 0 },  // Correct, uncertain
  { predicted: 0.80, outcome: 0 },  // Wrong, confident (costly)
  { predicted: 0.60, outcome: 1 },  // Correct, moderate
]

const bs = brierScore(predictions)
console.log(`Brier Score: ${bs.toFixed(4)}`)
// Output: Brier Score: 0.1525

Brier Skill Score (BSS)

Measures: Improvement over a baseline model
Range: -∞ to 1
Interpretation: BSS > 0 means better than baseline; BSS = 1 is perfect

Formula

BSS = 1 - (BS_model / BS_baseline)

Baseline (random 50% guess):
  BS_baseline = 0.25

Implementation

export function brierSkillScore(bs, baseline = 0.25) {
  if (baseline === 0) return NaN
  return 1 - (bs / baseline)
}
/**
 * Brier Skill Score: 1 - (BS_model / BS_baseline)
 * BSS > 0 means better than baseline. BSS = 1 is perfect.
 * @param {number} bs Model's Brier Score
 * @param {number} [baseline=0.25] Baseline Brier Score (0.25 = random 50%)
 * @returns {number}
 */
export function brierSkillScore(bs, baseline = 0.25) {
  if (baseline === 0) return NaN
  return 1 - (bs / baseline)
}

Example

import { brierScore, brierSkillScore } from './engine/metrics.js'

const bs = 0.1525
const bss = brierSkillScore(bs, 0.25)
console.log(`BSS: ${bss.toFixed(2)}`)
// Output: BSS: 0.39 (39% improvement over random)

Log Loss (Binary Cross-Entropy)

Measures: Prediction confidence penalty (heavily punishes confident mistakes)
Range: 0 (perfect) to ∞
Baseline: 0.693 (random 50% guess)

Formula

Log Loss = -(1/N) × Σ[o×log(p) + (1-o)×log(1-p)]

where:
  p = predicted probability (clamped to [ε, 1-ε] to avoid log(0))
  o = actual outcome (0 or 1)
  ε = 1e-15 (epsilon for numerical stability)

Implementation

const EPSILON = 1e-15

export function logLoss(data) {
  if (data.length === 0) return NaN
  let sum = 0
  for (const { predicted, outcome } of data) {
    const p = Math.max(EPSILON, Math.min(1 - EPSILON, predicted))
    sum += outcome * Math.log(p) + (1 - outcome) * Math.log(1 - p)
  }
  return -sum / data.length
}
/**
 * Log Loss (Binary Cross-Entropy): -(1/N) * sum[o*log(p) + (1-o)*log(1-p)]
 * Perfect = 0, Random (always 0.5) = 0.693, Worse > 0.693
 * @param {Array<{predicted: number, outcome: 0|1}>} data
 * @returns {number}
 */
export function logLoss(data) {
  if (data.length === 0) return NaN
  let sum = 0
  for (const { predicted, outcome } of data) {
    const p = Math.max(EPSILON, Math.min(1 - EPSILON, predicted))
    sum += outcome * Math.log(p) + (1 - outcome) * Math.log(1 - p)
  }
  return -sum / data.length
}

Interpretation

Log LossQualityInterpretation
0.00 - 0.30ExcellentVery confident and accurate
0.30 - 0.60GoodSolid predictions
0.60 - 0.693FairBarely better than random
> 0.693PoorWorse than random
When to use Log Loss vs Brier?
  • Log Loss: Use when confident mistakes are very costly (e.g., risk management)
  • Brier Score: Use when all errors should be weighted equally

Example

import { logLoss } from './engine/metrics.js'

const predictions = [
  { predicted: 0.90, outcome: 1 },  // Very confident, correct
  { predicted: 0.90, outcome: 0 },  // Very confident, WRONG (heavy penalty)
  { predicted: 0.55, outcome: 1 },  // Weak signal, correct
]

const ll = logLoss(predictions)
console.log(`Log Loss: ${ll.toFixed(4})`)
// Output: Log Loss: 0.8954 (worse than random due to confident mistake)

Murphy Decomposition

Measures: Breaks Brier Score into three interpretable components:
Formula: BS = Reliability - Resolution + Uncertainty

Components

ComponentMeaningGoal
ReliabilityHow well probabilities match observed frequenciesMinimize (0 = perfect)
ResolutionAbility to discriminate between outcomesMaximize (higher = better)
UncertaintyInherent randomness in outcomesConstant (oBar × (1-oBar))

Algorithm

  1. Bin predictions into K equal-width bins [0, 1/K), [1/K, 2/K), …, [(K-1)/K, 1]
  2. For each bin, compute:
    • Average predicted probability: p̄ = (1/n) × Σp_i
    • Average actual outcome: ō = (1/n) × Σo_i
  3. Compute components:
    • Reliability = Σ(n_k/N) × (p̄_k - ō_k)²
    • Resolution = Σ(n_k/N) × (ō_k - ōBar)²
    • Uncertainty = ōBar × (1 - ōBar)

Implementation

export function murphyDecomposition(data, numBins = 10) {
  if (data.length === 0) return { reliability: NaN, resolution: NaN, uncertainty: NaN }

  const N = data.length

  // Overall base rate
  const oBar = data.reduce((s, d) => s + d.outcome, 0) / N
  const uncertainty = oBar * (1 - oBar)

  // Bin the data
  const bins = Array.from({ length: numBins }, () => ({ sumP: 0, sumO: 0, count: 0 }))

  for (const { predicted, outcome } of data) {
    let binIdx = Math.floor(predicted * numBins)
    if (binIdx >= numBins) binIdx = numBins - 1
    if (binIdx < 0) binIdx = 0
    bins[binIdx].sumP += predicted
    bins[binIdx].sumO += outcome
    bins[binIdx].count += 1
  }

  let reliability = 0
  let resolution = 0

  for (const bin of bins) {
    if (bin.count === 0) continue
    const avgP = bin.sumP / bin.count
    const avgO = bin.sumO / bin.count
    reliability += (bin.count / N) * (avgP - avgO) ** 2
    resolution += (bin.count / N) * (avgO - oBar) ** 2
  }

  return { reliability, resolution, uncertainty }
}
/**
 * Murphy (1973) decomposition: BS = Reliability - Resolution + Uncertainty
 *
 * Bins predictions into equal-width bins [0, 1/K), [1/K, 2/K), ..., [(K-1)/K, 1]
 * and computes the three components.
 *
 * @param {Array<{predicted: number, outcome: 0|1}>} data
 * @param {number} [numBins=10]
 * @returns {{ reliability: number, resolution: number, uncertainty: number }}
 */
export function murphyDecomposition(data, numBins = 10) {
  if (data.length === 0) return { reliability: NaN, resolution: NaN, uncertainty: NaN }

  const N = data.length

  // Overall base rate
  const oBar = data.reduce((s, d) => s + d.outcome, 0) / N
  const uncertainty = oBar * (1 - oBar)

  // Bin the data
  const bins = Array.from({ length: numBins }, () => ({ sumP: 0, sumO: 0, count: 0 }))

  for (const { predicted, outcome } of data) {
    let binIdx = Math.floor(predicted * numBins)
    if (binIdx >= numBins) binIdx = numBins - 1
    if (binIdx < 0) binIdx = 0
    bins[binIdx].sumP += predicted
    bins[binIdx].sumO += outcome
    bins[binIdx].count += 1
  }

  let reliability = 0
  let resolution = 0

  for (const bin of bins) {
    if (bin.count === 0) continue
    const avgP = bin.sumP / bin.count
    const avgO = bin.sumO / bin.count
    reliability += (bin.count / N) * (avgP - avgO) ** 2
    resolution += (bin.count / N) * (avgO - oBar) ** 2
  }

  return { reliability, resolution, uncertainty }
}

Example

import { murphyDecomposition } from './engine/metrics.js'

const { reliability, resolution, uncertainty } = murphyDecomposition(predictions)

console.log(`Reliability: ${reliability.toFixed(4)} (lower is better)`)
console.log(`Resolution:  ${resolution.toFixed(4)} (higher is better)`)
console.log(`Uncertainty: ${uncertainty.toFixed(4)} (constant)`)

// Example output:
// Reliability: 0.0123 (well calibrated)
// Resolution:  0.0845 (strong discrimination)
// Uncertainty: 0.2499 (base rate ≈ 50%)

// Verify: BS = Reliability - Resolution + Uncertainty
const bs = reliability - resolution + uncertainty
console.log(`Brier Score: ${bs.toFixed(4)}`)

Runs Test (Wald-Wolfowitz)

Measures: Serial independence in a binary sequence
Purpose: Detect patterns/streaks that violate randomness assumption

Concept

A “run” is a maximal sequence of consecutive identical values:
Sequence: 1 1 1 0 0 1 0 1 1
Runs:     [---] [---] [-] [-] [---]
Count:    5 runs
Under independence: The number of runs follows an approximately normal distribution with known mean and variance.

Formula

Expected runs: μ = (2×n₁×n₀)/n + 1

Variance: σ² = (2×n₁×n₀×(2×n₁×n₀ - n)) / (n²×(n-1))

Z-score: z = (R - μ) / σ

where:
  R  = observed number of runs
  n₁ = count of 1's
  n₀ = count of 0's
  n  = total length

Implementation

export function runsTest(outcomes) {
  if (outcomes.length < 2) return { runs: NaN, expected: NaN, zScore: NaN, pValue: NaN }

  const n = outcomes.length
  const n1 = outcomes.filter(o => o === 1).length
  const n0 = n - n1

  if (n1 === 0 || n0 === 0) return { runs: NaN, expected: NaN, zScore: NaN, pValue: NaN }

  // Count runs
  let runs = 1
  for (let i = 1; i < n; i++) {
    if (outcomes[i] !== outcomes[i - 1]) runs++
  }

  // Expected runs and variance under independence
  const expected = (2 * n1 * n0) / n + 1
  const variance = (2 * n1 * n0 * (2 * n1 * n0 - n)) / (n * n * (n - 1))

  if (variance <= 0) return { runs, expected, zScore: NaN, pValue: NaN }

  const zScore = (runs - expected) / Math.sqrt(variance)

  // Two-tailed p-value from standard normal
  const pValue = 2 * (1 - normalCDF(Math.abs(zScore)))

  return { runs, expected: +expected.toFixed(4), zScore: +zScore.toFixed(4), pValue: +pValue.toFixed(4) }
}
/**
 * Wald-Wolfowitz runs test for serial independence in a binary sequence.
 *
 * A "run" is a maximal sequence of consecutive identical values.
 * Under independence, the number of runs follows an approximately normal
 * distribution with known mean and variance.
 *
 * @param {Array<0|1>} outcomes Binary sequence
 * @returns {{ runs: number, expected: number, zScore: number, pValue: number }}
 */
export function runsTest(outcomes) {
  if (outcomes.length < 2) return { runs: NaN, expected: NaN, zScore: NaN, pValue: NaN }

  const n = outcomes.length
  const n1 = outcomes.filter(o => o === 1).length
  const n0 = n - n1

  if (n1 === 0 || n0 === 0) return { runs: NaN, expected: NaN, zScore: NaN, pValue: NaN }

  // Count runs
  let runs = 1
  for (let i = 1; i < n; i++) {
    if (outcomes[i] !== outcomes[i - 1]) runs++
  }

  // Expected runs and variance under independence
  const expected = (2 * n1 * n0) / n + 1
  const variance = (2 * n1 * n0 * (2 * n1 * n0 - n)) / (n * n * (n - 1))

  if (variance <= 0) return { runs, expected, zScore: NaN, pValue: NaN }

  const zScore = (runs - expected) / Math.sqrt(variance)

  // Two-tailed p-value from standard normal (using the error function approximation)
  const pValue = 2 * (1 - normalCDF(Math.abs(zScore)))

  return { runs, expected: +expected.toFixed(4), zScore: +zScore.toFixed(4), pValue: +pValue.toFixed(4) }
}

Interpretation

Z-ScoreP-ValueInterpretation
-2 to +2> 0.05Pass: Sequence appears random
< -2< 0.05Too few runs (clustering/streaks)
> +2< 0.05Too many runs (oscillation)

Example

import { runsTest } from './engine/metrics.js'

// Random sequence (should pass)
const random = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
const result1 = runsTest(random)
console.log(`Runs: ${result1.runs}, Expected: ${result1.expected}, Z: ${result1.zScore}, p: ${result1.pValue}`)
// Output: Runs: 8, Expected: 6.0, Z: 0.94, p: 0.35 (PASS: appears random)

// Streaky sequence (should fail)
const streaky = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
const result2 = runsTest(streaky)
console.log(`Runs: ${result2.runs}, Expected: ${result2.expected}, Z: ${result2.zScore}, p: ${result2.pValue}`)
// Output: Runs: 2, Expected: 6.0, Z: -2.53, p: 0.01 (FAIL: too few runs, clustering detected)
Cold Streak Detection: If the runs test shows Z < -2, the model is producing streaky predictions rather than independent ones. This is a red flag for risk management.

Band Analysis

Classifies predictions into 5 confidence bands and computes per-band accuracy, Brier score, and mean probability.

Band Definitions

BandLabelRangeConfidence Distance
1Ruido45-55%|p - 0.5| < 0.05
2Senal debil55-65%|p - 0.5| 0.05-0.15
3Senal moderada65-75%|p - 0.5| 0.15-0.25
4Senal fuerte75-85%|p - 0.5| 0.25-0.35
5Senal muy fuerte85%+|p - 0.5| ≥ 0.35

Implementation

export function bandAnalysis(records) {
  const bands = [
    { band: 1, label: 'Ruido',           range: '45-55%',  min: 0.00, max: 0.05, items: [] },
    { band: 2, label: 'Senal debil',     range: '55-65%',  min: 0.05, max: 0.15, items: [] },
    { band: 3, label: 'Senal moderada',  range: '65-75%',  min: 0.15, max: 0.25, items: [] },
    { band: 4, label: 'Senal fuerte',    range: '75-85%',  min: 0.25, max: 0.35, items: [] },
    { band: 5, label: 'Senal muy fuerte', range: '85%+',   min: 0.35, max: Infinity, items: [] },
  ]

  for (const record of records) {
    const ep = record.earlyPrediction
    if (!ep || ep.abstained) continue

    const confidence = Math.abs(ep.probability - 0.5)
    const correct = record.earlyPredictionCorrect

    if (correct == null) continue

    // Find the right band
    for (const b of bands) {
      if (confidence >= b.min && confidence < b.max) {
        b.items.push({ confidence, correct, record })
        break
      }
    }
  }

  // Build scoring data per band for partial Brier calculation
  return bands.map(b => {
    const count = b.items.length
    if (count === 0) {
      return {
        band: b.band, label: b.label, range: b.range,
        count: 0, accuracy: '--', brier: '--', meanProb: '--'
      }
    }

    const correctCount = b.items.filter(i => i.correct).length
    const accuracy = ((correctCount / count) * 100).toFixed(1)

    // Compute partial Brier for this band
    const scoringItems = []
    for (const item of b.items) {
      const ep = item.record.earlyPrediction
      if (ep.direction === 'UP') {
        scoringItems.push({ predicted: ep.probability, outcome: item.record.result === 'UP' ? 1 : 0 })
      } else if (ep.direction === 'DOWN') {
        scoringItems.push({ predicted: 1 - ep.probability, outcome: item.record.result === 'DOWN' ? 1 : 0 })
      }
    }
    const brier = scoringItems.length > 0 ? brierScore(scoringItems).toFixed(4) : '--'

    // Mean effective confidence
    const meanConf = b.items.reduce((s, i) => s + i.confidence, 0) / count
    const meanProb = (0.5 + meanConf).toFixed(2)

    return {
      band: b.band, label: b.label, range: b.range,
      count, accuracy, brier, meanProb
    }
  })
}
/**
 * 5-band confidence analysis.
 *
 * Classifies early predictions into 5 bands based on confidence distance
 * from 0.50, then computes per-band count, accuracy, partial Brier, and
 * mean confidence.
 *
 * Band boundaries (mapped from raw probability distance from 0.5):
 *   Band 1: 45-55%  (Ruido)          — |p-0.5| < 0.05 → effective 0.50-0.55
 *   Band 2: 55-65%  (Senal debil)    — |p-0.5| 0.05-0.15 → effective 0.55-0.65
 *   Band 3: 65-75%  (Senal moderada) — |p-0.5| 0.15-0.25 → effective 0.65-0.75
 *   Band 4: 75-85%  (Senal fuerte)   — |p-0.5| 0.25-0.35 → effective 0.75-0.85
 *   Band 5: 85%+    (Senal muy fuerte) — |p-0.5| >= 0.35 → effective 0.85+
 *
 * @param {Array<Object>} records IntervalRecord objects
 * @returns {Array<{band: number, label: string, range: string, count: number, accuracy: string, brier: string, meanProb: string}>}
 */
export function bandAnalysis(records) {
  // [implementation shown above]
}

Example Output

import { bandAnalysis } from './engine/metrics.js'
import { HistoryStore } from './tracker/history.js'

const history = new HistoryStore({ filePath: 'data/history.json' })
const records = await history.load()

const bands = bandAnalysis(records)
console.table(bands)
BandLabelRangeCountAccuracyBrierMean Prob
1Ruido45-55%2352.2%0.24890.52
2Senal debil55-65%4558.9%0.23010.60
3Senal moderada65-75%3868.4%0.18760.70
4Senal fuerte75-85%1275.0%0.11230.80
5Senal muy fuerte85%+3100.0%0.02890.91
Calibration Check: If accuracy closely matches mean probability in each band, the model is well-calibrated. Large discrepancies indicate miscalibration.

Data Conversion

Convert IntervalRecord objects into scoring data format:
export function intervalsToScoringData(records) {
  const data = []

  for (const record of records) {
    const ep = record.earlyPrediction
    if (!ep || ep.abstained) continue

    const direction = ep.direction
    const probability = ep.probability

    if (direction === 'UP') {
      data.push({
        predicted: probability,
        outcome: record.result === 'UP' ? 1 : 0
      })
    } else if (direction === 'DOWN') {
      data.push({
        predicted: 1 - probability,
        outcome: record.result === 'DOWN' ? 1 : 0
      })
    }
  }

  return data
}
/**
 * Convert closed IntervalRecord objects into scoring data format.
 *
 * Uses earlyPrediction.probability as the predicted value.
 * If direction='UP': predicted = probability, outcome = 1 when result='UP'.
 * If direction='DOWN': predicted = 1 - probability, outcome = 1 when result='DOWN'.
 * Skips records where earlyPrediction is null or has abstained flag.
 *
 * @param {Array<Object>} records IntervalRecord objects from history.json
 * @returns {Array<{predicted: number, outcome: 0|1}>}
 */
export function intervalsToScoringData(records) {
  const data = []

  for (const record of records) {
    const ep = record.earlyPrediction
    if (!ep || ep.abstained) continue

    const direction = ep.direction
    const probability = ep.probability

    if (direction === 'UP') {
      data.push({
        predicted: probability,
        outcome: record.result === 'UP' ? 1 : 0
      })
    } else if (direction === 'DOWN') {
      data.push({
        predicted: 1 - probability,
        outcome: record.result === 'DOWN' ? 1 : 0
      })
    }
  }

  return data
}

Full Analysis Pipeline

import { HistoryStore } from './tracker/history.js'
import {
  intervalsToScoringData,
  brierScore,
  brierSkillScore,
  logLoss,
  murphyDecomposition,
  runsTest,
  bandAnalysis
} from './engine/metrics.js'

// Load interval history
const history = new HistoryStore({ filePath: 'data/history.json' })
const records = await history.load()

// Convert to scoring format
const data = intervalsToScoringData(records)

// Compute all metrics
const bs = brierScore(data)
const bss = brierSkillScore(bs)
const ll = logLoss(data)
const murphy = murphyDecomposition(data)

// Extract outcomes for runs test
const outcomes = data.map(d => d.outcome)
const runs = runsTest(outcomes)

// Band analysis
const bands = bandAnalysis(records)

console.log('=== OVERALL METRICS ===')
console.log(`Brier Score:       ${bs.toFixed(4)}`)
console.log(`Brier Skill Score: ${bss.toFixed(2)}`)
console.log(`Log Loss:          ${ll.toFixed(4)}`)
console.log()
console.log('=== MURPHY DECOMPOSITION ===')
console.log(`Reliability: ${murphy.reliability.toFixed(4)}`)
console.log(`Resolution:  ${murphy.resolution.toFixed(4)}`)
console.log(`Uncertainty: ${murphy.uncertainty.toFixed(4)}`)
console.log()
console.log('=== RUNS TEST ===')
console.log(`Observed:  ${runs.runs} runs`)
console.log(`Expected:  ${runs.expected} runs`)
console.log(`Z-Score:   ${runs.zScore}`)
console.log(`P-Value:   ${runs.pValue} ${runs.pValue < 0.05 ? '(FAIL: not random)' : '(PASS: appears random)'}`)
console.log()
console.log('=== BAND ANALYSIS ===')
console.table(bands)

Interval Tracking

How intervals are tracked and closed

History Store

JSON persistence for interval records

Logging

Structured logs and tick data

Build docs developers (and LLMs) love