Metrics - Polymarket Bot

Overview

The metrics module implements proper scoring rules and statistical tests to evaluate probabilistic predictions. These metrics quantify model calibration, discrimination ability, and serial independence.

Proper Scoring Rules: Both Brier Score and Log Loss are “proper” — they are minimized when the forecaster reports their true beliefs. This property is critical for honest calibration.

Brier Score

Measures: Overall prediction error (calibration + discrimination)
Range: 0 (perfect) to 1 (worst)
Baseline: 0.25 (random coin flip, always predict 50%)

Formula

Brier Score = (1/N) × Σ(p_i - o_i)²

where:
  p_i = predicted probability
  o_i = actual outcome (0 or 1)
  N   = number of predictions

Implementation

export function brierScore(data) {
  if (data.length === 0) return NaN
  let sum = 0
  for (const { predicted, outcome } of data) {
    sum += (predicted - outcome) ** 2
  }
  return sum / data.length
}

/**
 * Brier Score: (1/N) * sum((p_i - o_i)^2)
 * Perfect = 0, Random (always 0.5) = 0.25, Worst = 1.0
 * @param {Array<{predicted: number, outcome: 0|1}>} data
 * @returns {number}
 */
export function brierScore(data) {
  if (data.length === 0) return NaN
  let sum = 0
  for (const { predicted, outcome } of data) {
    sum += (predicted - outcome) ** 2
  }
  return sum / data.length
}

Interpretation

Brier Score	Quality	Interpretation
0.00 - 0.10	Excellent	Very well calibrated
0.10 - 0.20	Good	Useful signal
0.20 - 0.25	Fair	Barely better than chance
> 0.25	Poor	Worse than random

Example

import { brierScore } from './engine/metrics.js'

const predictions = [
  { predicted: 0.70, outcome: 1 },  // Correct, confident
  { predicted: 0.45, outcome: 0 },  // Correct, uncertain
  { predicted: 0.80, outcome: 0 },  // Wrong, confident (costly)
  { predicted: 0.60, outcome: 1 },  // Correct, moderate
]

const bs = brierScore(predictions)
console.log(`Brier Score: ${bs.toFixed(4)}`)
// Output: Brier Score: 0.1525

Brier Skill Score (BSS)

Measures: Improvement over a baseline model
Range: -∞ to 1
Interpretation: BSS > 0 means better than baseline; BSS = 1 is perfect

Formula

BSS = 1 - (BS_model / BS_baseline)

Baseline (random 50% guess):
  BS_baseline = 0.25

Implementation

export function brierSkillScore(bs, baseline = 0.25) {
  if (baseline === 0) return NaN
  return 1 - (bs / baseline)
}

/**
 * Brier Skill Score: 1 - (BS_model / BS_baseline)
 * BSS > 0 means better than baseline. BSS = 1 is perfect.
 * @param {number} bs Model's Brier Score
 * @param {number} [baseline=0.25] Baseline Brier Score (0.25 = random 50%)
 * @returns {number}
 */
export function brierSkillScore(bs, baseline = 0.25) {
  if (baseline === 0) return NaN
  return 1 - (bs / baseline)
}

Example

import { brierScore, brierSkillScore } from './engine/metrics.js'

const bs = 0.1525
const bss = brierSkillScore(bs, 0.25)
console.log(`BSS: ${bss.toFixed(2)}`)
// Output: BSS: 0.39 (39% improvement over random)

Log Loss (Binary Cross-Entropy)

Measures: Prediction confidence penalty (heavily punishes confident mistakes)
Range: 0 (perfect) to ∞
Baseline: 0.693 (random 50% guess)

Formula

Log Loss = -(1/N) × Σ[o×log(p) + (1-o)×log(1-p)]

where:
  p = predicted probability (clamped to [ε, 1-ε] to avoid log(0))
  o = actual outcome (0 or 1)
  ε = 1e-15 (epsilon for numerical stability)

Implementation

const EPSILON = 1e-15

export function logLoss(data) {
  if (data.length === 0) return NaN
  let sum = 0
  for (const { predicted, outcome } of data) {
    const p = Math.max(EPSILON, Math.min(1 - EPSILON, predicted))
    sum += outcome * Math.log(p) + (1 - outcome) * Math.log(1 - p)
  }
  return -sum / data.length
}

/**
 * Log Loss (Binary Cross-Entropy): -(1/N) * sum[o*log(p) + (1-o)*log(1-p)]
 * Perfect = 0, Random (always 0.5) = 0.693, Worse > 0.693
 * @param {Array<{predicted: number, outcome: 0|1}>} data
 * @returns {number}
 */
export function logLoss(data) {
  if (data.length === 0) return NaN
  let sum = 0
  for (const { predicted, outcome } of data) {
    const p = Math.max(EPSILON, Math.min(1 - EPSILON, predicted))
    sum += outcome * Math.log(p) + (1 - outcome) * Math.log(1 - p)
  }
  return -sum / data.length
}

Interpretation

Log Loss	Quality	Interpretation
0.00 - 0.30	Excellent	Very confident and accurate
0.30 - 0.60	Good	Solid predictions
0.60 - 0.693	Fair	Barely better than random
> 0.693	Poor	Worse than random

When to use Log Loss vs Brier?

Log Loss: Use when confident mistakes are very costly (e.g., risk management)
Brier Score: Use when all errors should be weighted equally

Example

import { logLoss } from './engine/metrics.js'

const predictions = [
  { predicted: 0.90, outcome: 1 },  // Very confident, correct
  { predicted: 0.90, outcome: 0 },  // Very confident, WRONG (heavy penalty)
  { predicted: 0.55, outcome: 1 },  // Weak signal, correct
]

const ll = logLoss(predictions)
console.log(`Log Loss: ${ll.toFixed(4})`)
// Output: Log Loss: 0.8954 (worse than random due to confident mistake)

Murphy Decomposition

Measures: Breaks Brier Score into three interpretable components:
Formula: BS = Reliability - Resolution + Uncertainty

Components

Component	Meaning	Goal
Reliability	How well probabilities match observed frequencies	Minimize (0 = perfect)
Resolution	Ability to discriminate between outcomes	Maximize (higher = better)
Uncertainty	Inherent randomness in outcomes	Constant (oBar × (1-oBar))

Algorithm

Bin predictions into K equal-width bins [0, 1/K), [1/K, 2/K), …, [(K-1)/K, 1]
For each bin, compute:
- Average predicted probability: p̄ = (1/n) × Σp_i
- Average actual outcome: ō = (1/n) × Σo_i
Compute components:
- Reliability = Σ(n_k/N) × (p̄_k - ō_k)²
- Resolution = Σ(n_k/N) × (ō_k - ōBar)²
- Uncertainty = ōBar × (1 - ōBar)

Implementation

export function murphyDecomposition(data, numBins = 10) {
  if (data.length === 0) return { reliability: NaN, resolution: NaN, uncertainty: NaN }

  const N = data.length

  // Overall base rate
  const oBar = data.reduce((s, d) => s + d.outcome, 0) / N
  const uncertainty = oBar * (1 - oBar)

  // Bin the data
  const bins = Array.from({ length: numBins }, () => ({ sumP: 0, sumO: 0, count: 0 }))

  for (const { predicted, outcome } of data) {
    let binIdx = Math.floor(predicted * numBins)
    if (binIdx >= numBins) binIdx = numBins - 1
    if (binIdx < 0) binIdx = 0
    bins[binIdx].sumP += predicted
    bins[binIdx].sumO += outcome
    bins[binIdx].count += 1
  }

  let reliability = 0
  let resolution = 0

  for (const bin of bins) {
    if (bin.count === 0) continue
    const avgP = bin.sumP / bin.count
    const avgO = bin.sumO / bin.count
    reliability += (bin.count / N) * (avgP - avgO) ** 2
    resolution += (bin.count / N) * (avgO - oBar) ** 2
  }

  return { reliability, resolution, uncertainty }
}

/**
 * Murphy (1973) decomposition: BS = Reliability - Resolution + Uncertainty
 *
 * Bins predictions into equal-width bins [0, 1/K), [1/K, 2/K), ..., [(K-1)/K, 1]
 * and computes the three components.
 *
 * @param {Array<{predicted: number, outcome: 0|1}>} data
 * @param {number} [numBins=10]
 * @returns {{ reliability: number, resolution: number, uncertainty: number }}
 */
export function murphyDecomposition(data, numBins = 10) {
  if (data.length === 0) return { reliability: NaN, resolution: NaN, uncertainty: NaN }

  const N = data.length

  // Overall base rate
  const oBar = data.reduce((s, d) => s + d.outcome, 0) / N
  const uncertainty = oBar * (1 - oBar)

  // Bin the data
  const bins = Array.from({ length: numBins }, () => ({ sumP: 0, sumO: 0, count: 0 }))

  for (const { predicted, outcome } of data) {
    let binIdx = Math.floor(predicted * numBins)
    if (binIdx >= numBins) binIdx = numBins - 1
    if (binIdx < 0) binIdx = 0
    bins[binIdx].sumP += predicted
    bins[binIdx].sumO += outcome
    bins[binIdx].count += 1
  }

  let reliability = 0
  let resolution = 0

  for (const bin of bins) {
    if (bin.count === 0) continue
    const avgP = bin.sumP / bin.count
    const avgO = bin.sumO / bin.count
    reliability += (bin.count / N) * (avgP - avgO) ** 2
    resolution += (bin.count / N) * (avgO - oBar) ** 2
  }

  return { reliability, resolution, uncertainty }
}

Example

import { murphyDecomposition } from './engine/metrics.js'

const { reliability, resolution, uncertainty } = murphyDecomposition(predictions)

console.log(`Reliability: ${reliability.toFixed(4)} (lower is better)`)
console.log(`Resolution:  ${resolution.toFixed(4)} (higher is better)`)
console.log(`Uncertainty: ${uncertainty.toFixed(4)} (constant)`)

// Example output:
// Reliability: 0.0123 (well calibrated)
// Resolution:  0.0845 (strong discrimination)
// Uncertainty: 0.2499 (base rate ≈ 50%)

// Verify: BS = Reliability - Resolution + Uncertainty
const bs = reliability - resolution + uncertainty
console.log(`Brier Score: ${bs.toFixed(4)}`)

Runs Test (Wald-Wolfowitz)

Measures: Serial independence in a binary sequence
Purpose: Detect patterns/streaks that violate randomness assumption

Concept

A “run” is a maximal sequence of consecutive identical values:

Sequence: 1 1 1 0 0 1 0 1 1
Runs:     [---] [---] [-] [-] [---]
Count:    5 runs

Under independence: The number of runs follows an approximately normal distribution with known mean and variance.

Formula

Expected runs: μ = (2×n₁×n₀)/n + 1

Variance: σ² = (2×n₁×n₀×(2×n₁×n₀ - n)) / (n²×(n-1))

Z-score: z = (R - μ) / σ

where:
  R  = observed number of runs
  n₁ = count of 1's
  n₀ = count of 0's
  n  = total length

Implementation

export function runsTest(outcomes) {
  if (outcomes.length < 2) return { runs: NaN, expected: NaN, zScore: NaN, pValue: NaN }

  const n = outcomes.length
  const n1 = outcomes.filter(o => o === 1).length
  const n0 = n - n1

  if (n1 === 0 || n0 === 0) return { runs: NaN, expected: NaN, zScore: NaN, pValue: NaN }

  // Count runs
  let runs = 1
  for (let i = 1; i < n; i++) {
    if (outcomes[i] !== outcomes[i - 1]) runs++
  }

  // Expected runs and variance under independence
  const expected = (2 * n1 * n0) / n + 1
  const variance = (2 * n1 * n0 * (2 * n1 * n0 - n)) / (n * n * (n - 1))

  if (variance <= 0) return { runs, expected, zScore: NaN, pValue: NaN }

  const zScore = (runs - expected) / Math.sqrt(variance)

  // Two-tailed p-value from standard normal
  const pValue = 2 * (1 - normalCDF(Math.abs(zScore)))

  return { runs, expected: +expected.toFixed(4), zScore: +zScore.toFixed(4), pValue: +pValue.toFixed(4) }
}

/**
 * Wald-Wolfowitz runs test for serial independence in a binary sequence.
 *
 * A "run" is a maximal sequence of consecutive identical values.
 * Under independence, the number of runs follows an approximately normal
 * distribution with known mean and variance.
 *
 * @param {Array<0|1>} outcomes Binary sequence
 * @returns {{ runs: number, expected: number, zScore: number, pValue: number }}
 */
export function runsTest(outcomes) {
  if (outcomes.length < 2) return { runs: NaN, expected: NaN, zScore: NaN, pValue: NaN }

  const n = outcomes.length
  const n1 = outcomes.filter(o => o === 1).length
  const n0 = n - n1

  if (n1 === 0 || n0 === 0) return { runs: NaN, expected: NaN, zScore: NaN, pValue: NaN }

  // Count runs
  let runs = 1
  for (let i = 1; i < n; i++) {
    if (outcomes[i] !== outcomes[i - 1]) runs++
  }

  // Expected runs and variance under independence
  const expected = (2 * n1 * n0) / n + 1
  const variance = (2 * n1 * n0 * (2 * n1 * n0 - n)) / (n * n * (n - 1))

  if (variance <= 0) return { runs, expected, zScore: NaN, pValue: NaN }

  const zScore = (runs - expected) / Math.sqrt(variance)

  // Two-tailed p-value from standard normal (using the error function approximation)
  const pValue = 2 * (1 - normalCDF(Math.abs(zScore)))

  return { runs, expected: +expected.toFixed(4), zScore: +zScore.toFixed(4), pValue: +pValue.toFixed(4) }
}

Interpretation

Z-Score	P-Value	Interpretation
-2 to +2	> 0.05	Pass: Sequence appears random
< -2	< 0.05	Too few runs (clustering/streaks)
> +2	< 0.05	Too many runs (oscillation)

Example

import { runsTest } from './engine/metrics.js'

// Random sequence (should pass)
const random = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
const result1 = runsTest(random)
console.log(`Runs: ${result1.runs}, Expected: ${result1.expected}, Z: ${result1.zScore}, p: ${result1.pValue}`)
// Output: Runs: 8, Expected: 6.0, Z: 0.94, p: 0.35 (PASS: appears random)

// Streaky sequence (should fail)
const streaky = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
const result2 = runsTest(streaky)
console.log(`Runs: ${result2.runs}, Expected: ${result2.expected}, Z: ${result2.zScore}, p: ${result2.pValue}`)
// Output: Runs: 2, Expected: 6.0, Z: -2.53, p: 0.01 (FAIL: too few runs, clustering detected)

Cold Streak Detection: If the runs test shows Z < -2, the model is producing streaky predictions rather than independent ones. This is a red flag for risk management.

Band Analysis

Classifies predictions into 5 confidence bands and computes per-band accuracy, Brier score, and mean probability.

Band Definitions

Band	Label	Range	Confidence Distance
1	Ruido	45-55%	\|p - 0.5\| < 0.05
2	Senal debil	55-65%	\|p - 0.5\| 0.05-0.15
3	Senal moderada	65-75%	\|p - 0.5\| 0.15-0.25
4	Senal fuerte	75-85%	\|p - 0.5\| 0.25-0.35
5	Senal muy fuerte	85%+	\|p - 0.5\| ≥ 0.35

Implementation

export function bandAnalysis(records) {
  const bands = [
    { band: 1, label: 'Ruido',           range: '45-55%',  min: 0.00, max: 0.05, items: [] },
    { band: 2, label: 'Senal debil',     range: '55-65%',  min: 0.05, max: 0.15, items: [] },
    { band: 3, label: 'Senal moderada',  range: '65-75%',  min: 0.15, max: 0.25, items: [] },
    { band: 4, label: 'Senal fuerte',    range: '75-85%',  min: 0.25, max: 0.35, items: [] },
    { band: 5, label: 'Senal muy fuerte', range: '85%+',   min: 0.35, max: Infinity, items: [] },
  ]

  for (const record of records) {
    const ep = record.earlyPrediction
    if (!ep || ep.abstained) continue

    const confidence = Math.abs(ep.probability - 0.5)
    const correct = record.earlyPredictionCorrect

    if (correct == null) continue

    // Find the right band
    for (const b of bands) {
      if (confidence >= b.min && confidence < b.max) {
        b.items.push({ confidence, correct, record })
        break
      }
    }
  }

  // Build scoring data per band for partial Brier calculation
  return bands.map(b => {
    const count = b.items.length
    if (count === 0) {
      return {
        band: b.band, label: b.label, range: b.range,
        count: 0, accuracy: '--', brier: '--', meanProb: '--'
      }
    }

    const correctCount = b.items.filter(i => i.correct).length
    const accuracy = ((correctCount / count) * 100).toFixed(1)

    // Compute partial Brier for this band
    const scoringItems = []
    for (const item of b.items) {
      const ep = item.record.earlyPrediction
      if (ep.direction === 'UP') {
        scoringItems.push({ predicted: ep.probability, outcome: item.record.result === 'UP' ? 1 : 0 })
      } else if (ep.direction === 'DOWN') {
        scoringItems.push({ predicted: 1 - ep.probability, outcome: item.record.result === 'DOWN' ? 1 : 0 })
      }
    }
    const brier = scoringItems.length > 0 ? brierScore(scoringItems).toFixed(4) : '--'

    // Mean effective confidence
    const meanConf = b.items.reduce((s, i) => s + i.confidence, 0) / count
    const meanProb = (0.5 + meanConf).toFixed(2)

    return {
      band: b.band, label: b.label, range: b.range,
      count, accuracy, brier, meanProb
    }
  })
}

/**
 * 5-band confidence analysis.
 *
 * Classifies early predictions into 5 bands based on confidence distance
 * from 0.50, then computes per-band count, accuracy, partial Brier, and
 * mean confidence.
 *
 * Band boundaries (mapped from raw probability distance from 0.5):
 *   Band 1: 45-55%  (Ruido)          — |p-0.5| < 0.05 → effective 0.50-0.55
 *   Band 2: 55-65%  (Senal debil)    — |p-0.5| 0.05-0.15 → effective 0.55-0.65
 *   Band 3: 65-75%  (Senal moderada) — |p-0.5| 0.15-0.25 → effective 0.65-0.75
 *   Band 4: 75-85%  (Senal fuerte)   — |p-0.5| 0.25-0.35 → effective 0.75-0.85
 *   Band 5: 85%+    (Senal muy fuerte) — |p-0.5| >= 0.35 → effective 0.85+
 *
 * @param {Array<Object>} records IntervalRecord objects
 * @returns {Array<{band: number, label: string, range: string, count: number, accuracy: string, brier: string, meanProb: string}>}
 */
export function bandAnalysis(records) {
  // [implementation shown above]
}

Example Output

import { bandAnalysis } from './engine/metrics.js'
import { HistoryStore } from './tracker/history.js'

const history = new HistoryStore({ filePath: 'data/history.json' })
const records = await history.load()

const bands = bandAnalysis(records)
console.table(bands)

Band	Label	Range	Count	Accuracy	Brier	Mean Prob
1	Ruido	45-55%	23	52.2%	0.2489	0.52
2	Senal debil	55-65%	45	58.9%	0.2301	0.60
3	Senal moderada	65-75%	38	68.4%	0.1876	0.70
4	Senal fuerte	75-85%	12	75.0%	0.1123	0.80
5	Senal muy fuerte	85%+	3	100.0%	0.0289	0.91

Calibration Check: If accuracy closely matches mean probability in each band, the model is well-calibrated. Large discrepancies indicate miscalibration.

Data Conversion

Convert IntervalRecord objects into scoring data format:

export function intervalsToScoringData(records) {
  const data = []

  for (const record of records) {
    const ep = record.earlyPrediction
    if (!ep || ep.abstained) continue

    const direction = ep.direction
    const probability = ep.probability

    if (direction === 'UP') {
      data.push({
        predicted: probability,
        outcome: record.result === 'UP' ? 1 : 0
      })
    } else if (direction === 'DOWN') {
      data.push({
        predicted: 1 - probability,
        outcome: record.result === 'DOWN' ? 1 : 0
      })
    }
  }

  return data
}

/**
 * Convert closed IntervalRecord objects into scoring data format.
 *
 * Uses earlyPrediction.probability as the predicted value.
 * If direction='UP': predicted = probability, outcome = 1 when result='UP'.
 * If direction='DOWN': predicted = 1 - probability, outcome = 1 when result='DOWN'.
 * Skips records where earlyPrediction is null or has abstained flag.
 *
 * @param {Array<Object>} records IntervalRecord objects from history.json
 * @returns {Array<{predicted: number, outcome: 0|1}>}
 */
export function intervalsToScoringData(records) {
  const data = []

  for (const record of records) {
    const ep = record.earlyPrediction
    if (!ep || ep.abstained) continue

    const direction = ep.direction
    const probability = ep.probability

    if (direction === 'UP') {
      data.push({
        predicted: probability,
        outcome: record.result === 'UP' ? 1 : 0
      })
    } else if (direction === 'DOWN') {
      data.push({
        predicted: 1 - probability,
        outcome: record.result === 'DOWN' ? 1 : 0
      })
    }
  }

  return data
}

Full Analysis Pipeline

import { HistoryStore } from './tracker/history.js'
import {
  intervalsToScoringData,
  brierScore,
  brierSkillScore,
  logLoss,
  murphyDecomposition,
  runsTest,
  bandAnalysis
} from './engine/metrics.js'

// Load interval history
const history = new HistoryStore({ filePath: 'data/history.json' })
const records = await history.load()

// Convert to scoring format
const data = intervalsToScoringData(records)

// Compute all metrics
const bs = brierScore(data)
const bss = brierSkillScore(bs)
const ll = logLoss(data)
const murphy = murphyDecomposition(data)

// Extract outcomes for runs test
const outcomes = data.map(d => d.outcome)
const runs = runsTest(outcomes)

// Band analysis
const bands = bandAnalysis(records)

console.log('=== OVERALL METRICS ===')
console.log(`Brier Score:       ${bs.toFixed(4)}`)
console.log(`Brier Skill Score: ${bss.toFixed(2)}`)
console.log(`Log Loss:          ${ll.toFixed(4)}`)
console.log()
console.log('=== MURPHY DECOMPOSITION ===')
console.log(`Reliability: ${murphy.reliability.toFixed(4)}`)
console.log(`Resolution:  ${murphy.resolution.toFixed(4)}`)
console.log(`Uncertainty: ${murphy.uncertainty.toFixed(4)}`)
console.log()
console.log('=== RUNS TEST ===')
console.log(`Observed:  ${runs.runs} runs`)
console.log(`Expected:  ${runs.expected} runs`)
console.log(`Z-Score:   ${runs.zScore}`)
console.log(`P-Value:   ${runs.pValue} ${runs.pValue < 0.05 ? '(FAIL: not random)' : '(PASS: appears random)'}`)
console.log()
console.log('=== BAND ANALYSIS ===')
console.table(bands)

Interval Tracking

How intervals are tracked and closed

History Store

JSON persistence for interval records

Logging

Structured logs and tick data

Prediction Engine

Data & Tracking

Risk Management

API Reference

​Overview

​Brier Score

​Formula

​Implementation

​Interpretation

​Example

​Brier Skill Score (BSS)

​Formula

​Implementation

​Example

​Log Loss (Binary Cross-Entropy)

​Formula

​Implementation

​Interpretation

​Example

​Murphy Decomposition

​Components

​Algorithm

​Implementation

​Example

​Runs Test (Wald-Wolfowitz)

​Concept

​Formula

​Implementation

​Interpretation

​Example

​Band Analysis

​Band Definitions

​Implementation

​Example Output

​Data Conversion

​Full Analysis Pipeline

​Related

Interval Tracking

History Store

Logging

Build docs developers (and LLMs) love

Overview

Brier Score

Formula

Implementation

Interpretation

Example

Brier Skill Score (BSS)

Formula

Implementation

Example

Log Loss (Binary Cross-Entropy)

Formula

Implementation

Interpretation

Example

Murphy Decomposition

Components

Algorithm

Implementation

Example

Runs Test (Wald-Wolfowitz)

Concept

Formula

Implementation

Interpretation

Example

Band Analysis

Band Definitions

Implementation

Example Output

Data Conversion

Full Analysis Pipeline

Related