Overview
The metrics module implements proper scoring rules and statistical tests to evaluate probabilistic predictions. These metrics quantify model calibration, discrimination ability, and serial independence.
Proper Scoring Rules : Both Brier Score and Log Loss are “proper” — they are minimized when the forecaster reports their true beliefs. This property is critical for honest calibration.
Brier Score
Measures: Overall prediction error (calibration + discrimination)
Range: 0 (perfect) to 1 (worst)
Baseline: 0.25 (random coin flip, always predict 50%)
Brier Score = (1/N) × Σ(p_i - o_i)²
where:
p_i = predicted probability
o_i = actual outcome (0 or 1)
N = number of predictions
Implementation
export function brierScore ( data ) {
if ( data . length === 0 ) return NaN
let sum = 0
for ( const { predicted , outcome } of data ) {
sum += ( predicted - outcome ) ** 2
}
return sum / data . length
}
src/engine/metrics.js (lines 8-21)
/**
* Brier Score: (1/N) * sum((p_i - o_i)^2)
* Perfect = 0, Random (always 0.5) = 0.25, Worst = 1.0
* @param {Array<{predicted: number, outcome: 0|1}>} data
* @returns {number}
*/
export function brierScore ( data ) {
if ( data . length === 0 ) return NaN
let sum = 0
for ( const { predicted , outcome } of data ) {
sum += ( predicted - outcome ) ** 2
}
return sum / data . length
}
Interpretation
Brier Score Quality Interpretation 0.00 - 0.10 Excellent Very well calibrated 0.10 - 0.20 Good Useful signal 0.20 - 0.25 Fair Barely better than chance > 0.25 Poor Worse than random
Example
import { brierScore } from './engine/metrics.js'
const predictions = [
{ predicted: 0.70 , outcome: 1 }, // Correct, confident
{ predicted: 0.45 , outcome: 0 }, // Correct, uncertain
{ predicted: 0.80 , outcome: 0 }, // Wrong, confident (costly)
{ predicted: 0.60 , outcome: 1 }, // Correct, moderate
]
const bs = brierScore ( predictions )
console . log ( `Brier Score: ${ bs . toFixed ( 4 ) } ` )
// Output: Brier Score: 0.1525
Brier Skill Score (BSS)
Measures: Improvement over a baseline model
Range: -∞ to 1
Interpretation: BSS > 0 means better than baseline; BSS = 1 is perfect
BSS = 1 - (BS_model / BS_baseline)
Baseline (random 50% guess):
BS_baseline = 0.25
Implementation
export function brierSkillScore ( bs , baseline = 0.25 ) {
if ( baseline === 0 ) return NaN
return 1 - ( bs / baseline )
}
src/engine/metrics.js (lines 39-49)
/**
* Brier Skill Score: 1 - (BS_model / BS_baseline)
* BSS > 0 means better than baseline. BSS = 1 is perfect.
* @param {number} bs Model's Brier Score
* @param {number} [baseline = 0.25] Baseline Brier Score (0.25 = random 50%)
* @returns {number}
*/
export function brierSkillScore ( bs , baseline = 0.25 ) {
if ( baseline === 0 ) return NaN
return 1 - ( bs / baseline )
}
Example
import { brierScore , brierSkillScore } from './engine/metrics.js'
const bs = 0.1525
const bss = brierSkillScore ( bs , 0.25 )
console . log ( `BSS: ${ bss . toFixed ( 2 ) } ` )
// Output: BSS: 0.39 (39% improvement over random)
Log Loss (Binary Cross-Entropy)
Measures: Prediction confidence penalty (heavily punishes confident mistakes)
Range: 0 (perfect) to ∞
Baseline: 0.693 (random 50% guess)
Log Loss = -(1/N) × Σ[o×log(p) + (1-o)×log(1-p)]
where:
p = predicted probability (clamped to [ε, 1-ε] to avoid log(0))
o = actual outcome (0 or 1)
ε = 1e-15 (epsilon for numerical stability)
Implementation
const EPSILON = 1e-15
export function logLoss ( data ) {
if ( data . length === 0 ) return NaN
let sum = 0
for ( const { predicted , outcome } of data ) {
const p = Math . max ( EPSILON , Math . min ( 1 - EPSILON , predicted ))
sum += outcome * Math . log ( p ) + ( 1 - outcome ) * Math . log ( 1 - p )
}
return - sum / data . length
}
src/engine/metrics.js (lines 23-37)
/**
* Log Loss (Binary Cross-Entropy): -(1/N) * sum[o*log(p) + (1-o)*log(1-p)]
* Perfect = 0, Random (always 0.5) = 0.693, Worse > 0.693
* @param {Array<{predicted: number, outcome: 0|1}>} data
* @returns {number}
*/
export function logLoss ( data ) {
if ( data . length === 0 ) return NaN
let sum = 0
for ( const { predicted , outcome } of data ) {
const p = Math . max ( EPSILON , Math . min ( 1 - EPSILON , predicted ))
sum += outcome * Math . log ( p ) + ( 1 - outcome ) * Math . log ( 1 - p )
}
return - sum / data . length
}
Interpretation
Log Loss Quality Interpretation 0.00 - 0.30 Excellent Very confident and accurate 0.30 - 0.60 Good Solid predictions 0.60 - 0.693 Fair Barely better than random > 0.693 Poor Worse than random
When to use Log Loss vs Brier?
Log Loss : Use when confident mistakes are very costly (e.g., risk management)
Brier Score : Use when all errors should be weighted equally
Example
import { logLoss } from './engine/metrics.js'
const predictions = [
{ predicted: 0.90 , outcome: 1 }, // Very confident, correct
{ predicted: 0.90 , outcome: 0 }, // Very confident, WRONG (heavy penalty)
{ predicted: 0.55 , outcome: 1 }, // Weak signal, correct
]
const ll = logLoss ( predictions )
console . log ( `Log Loss: ${ ll . toFixed ( 4 }) `)
// Output: Log Loss: 0.8954 (worse than random due to confident mistake)
Murphy Decomposition
Measures: Breaks Brier Score into three interpretable components:
Formula: BS = Reliability - Resolution + Uncertainty
Components
Component Meaning Goal Reliability How well probabilities match observed frequencies Minimize (0 = perfect) Resolution Ability to discriminate between outcomes Maximize (higher = better) Uncertainty Inherent randomness in outcomes Constant (oBar × (1-oBar))
Algorithm
Bin predictions into K equal-width bins [0, 1/K), [1/K, 2/K), …, [(K-1)/K, 1]
For each bin, compute:
Average predicted probability: p̄ = (1/n) × Σp_i
Average actual outcome: ō = (1/n) × Σo_i
Compute components:
Reliability = Σ(n_k/N) × (p̄_k - ō_k)²
Resolution = Σ(n_k/N) × (ō_k - ōBar)²
Uncertainty = ōBar × (1 - ōBar)
Implementation
export function murphyDecomposition ( data , numBins = 10 ) {
if ( data . length === 0 ) return { reliability: NaN , resolution: NaN , uncertainty: NaN }
const N = data . length
// Overall base rate
const oBar = data . reduce (( s , d ) => s + d . outcome , 0 ) / N
const uncertainty = oBar * ( 1 - oBar )
// Bin the data
const bins = Array . from ({ length: numBins }, () => ({ sumP: 0 , sumO: 0 , count: 0 }))
for ( const { predicted , outcome } of data ) {
let binIdx = Math . floor ( predicted * numBins )
if ( binIdx >= numBins ) binIdx = numBins - 1
if ( binIdx < 0 ) binIdx = 0
bins [ binIdx ]. sumP += predicted
bins [ binIdx ]. sumO += outcome
bins [ binIdx ]. count += 1
}
let reliability = 0
let resolution = 0
for ( const bin of bins ) {
if ( bin . count === 0 ) continue
const avgP = bin . sumP / bin . count
const avgO = bin . sumO / bin . count
reliability += ( bin . count / N ) * ( avgP - avgO ) ** 2
resolution += ( bin . count / N ) * ( avgO - oBar ) ** 2
}
return { reliability , resolution , uncertainty }
}
src/engine/metrics.js (lines 51-94)
/**
* Murphy (1973) decomposition: BS = Reliability - Resolution + Uncertainty
*
* Bins predictions into equal-width bins [0, 1/K), [1/K, 2/K), ..., [(K-1)/K, 1]
* and computes the three components.
*
* @param {Array<{predicted: number, outcome: 0|1}>} data
* @param {number} [numBins = 10]
* @returns {{ reliability: number, resolution: number, uncertainty: number }}
*/
export function murphyDecomposition ( data , numBins = 10 ) {
if ( data . length === 0 ) return { reliability: NaN , resolution: NaN , uncertainty: NaN }
const N = data . length
// Overall base rate
const oBar = data . reduce (( s , d ) => s + d . outcome , 0 ) / N
const uncertainty = oBar * ( 1 - oBar )
// Bin the data
const bins = Array . from ({ length: numBins }, () => ({ sumP: 0 , sumO: 0 , count: 0 }))
for ( const { predicted , outcome } of data ) {
let binIdx = Math . floor ( predicted * numBins )
if ( binIdx >= numBins ) binIdx = numBins - 1
if ( binIdx < 0 ) binIdx = 0
bins [ binIdx ]. sumP += predicted
bins [ binIdx ]. sumO += outcome
bins [ binIdx ]. count += 1
}
let reliability = 0
let resolution = 0
for ( const bin of bins ) {
if ( bin . count === 0 ) continue
const avgP = bin . sumP / bin . count
const avgO = bin . sumO / bin . count
reliability += ( bin . count / N ) * ( avgP - avgO ) ** 2
resolution += ( bin . count / N ) * ( avgO - oBar ) ** 2
}
return { reliability , resolution , uncertainty }
}
Example
import { murphyDecomposition } from './engine/metrics.js'
const { reliability , resolution , uncertainty } = murphyDecomposition ( predictions )
console . log ( `Reliability: ${ reliability . toFixed ( 4 ) } (lower is better)` )
console . log ( `Resolution: ${ resolution . toFixed ( 4 ) } (higher is better)` )
console . log ( `Uncertainty: ${ uncertainty . toFixed ( 4 ) } (constant)` )
// Example output:
// Reliability: 0.0123 (well calibrated)
// Resolution: 0.0845 (strong discrimination)
// Uncertainty: 0.2499 (base rate ≈ 50%)
// Verify: BS = Reliability - Resolution + Uncertainty
const bs = reliability - resolution + uncertainty
console . log ( `Brier Score: ${ bs . toFixed ( 4 ) } ` )
Runs Test (Wald-Wolfowitz)
Measures: Serial independence in a binary sequence
Purpose: Detect patterns/streaks that violate randomness assumption
Concept
A “run” is a maximal sequence of consecutive identical values:
Sequence: 1 1 1 0 0 1 0 1 1
Runs: [---] [---] [-] [-] [---]
Count: 5 runs
Under independence: The number of runs follows an approximately normal distribution with known mean and variance.
Expected runs: μ = (2×n₁×n₀)/n + 1
Variance: σ² = (2×n₁×n₀×(2×n₁×n₀ - n)) / (n²×(n-1))
Z-score: z = (R - μ) / σ
where:
R = observed number of runs
n₁ = count of 1's
n₀ = count of 0's
n = total length
Implementation
export function runsTest ( outcomes ) {
if ( outcomes . length < 2 ) return { runs: NaN , expected: NaN , zScore: NaN , pValue: NaN }
const n = outcomes . length
const n1 = outcomes . filter ( o => o === 1 ). length
const n0 = n - n1
if ( n1 === 0 || n0 === 0 ) return { runs: NaN , expected: NaN , zScore: NaN , pValue: NaN }
// Count runs
let runs = 1
for ( let i = 1 ; i < n ; i ++ ) {
if ( outcomes [ i ] !== outcomes [ i - 1 ]) runs ++
}
// Expected runs and variance under independence
const expected = ( 2 * n1 * n0 ) / n + 1
const variance = ( 2 * n1 * n0 * ( 2 * n1 * n0 - n )) / ( n * n * ( n - 1 ))
if ( variance <= 0 ) return { runs , expected , zScore: NaN , pValue: NaN }
const zScore = ( runs - expected ) / Math . sqrt ( variance )
// Two-tailed p-value from standard normal
const pValue = 2 * ( 1 - normalCDF ( Math . abs ( zScore )))
return { runs , expected: + expected . toFixed ( 4 ), zScore: + zScore . toFixed ( 4 ), pValue: + pValue . toFixed ( 4 ) }
}
src/engine/metrics.js (lines 96-133)
/**
* Wald-Wolfowitz runs test for serial independence in a binary sequence.
*
* A "run" is a maximal sequence of consecutive identical values.
* Under independence, the number of runs follows an approximately normal
* distribution with known mean and variance.
*
* @param {Array<0|1>} outcomes Binary sequence
* @returns {{ runs: number, expected: number, zScore: number, pValue: number }}
*/
export function runsTest ( outcomes ) {
if ( outcomes . length < 2 ) return { runs: NaN , expected: NaN , zScore: NaN , pValue: NaN }
const n = outcomes . length
const n1 = outcomes . filter ( o => o === 1 ). length
const n0 = n - n1
if ( n1 === 0 || n0 === 0 ) return { runs: NaN , expected: NaN , zScore: NaN , pValue: NaN }
// Count runs
let runs = 1
for ( let i = 1 ; i < n ; i ++ ) {
if ( outcomes [ i ] !== outcomes [ i - 1 ]) runs ++
}
// Expected runs and variance under independence
const expected = ( 2 * n1 * n0 ) / n + 1
const variance = ( 2 * n1 * n0 * ( 2 * n1 * n0 - n )) / ( n * n * ( n - 1 ))
if ( variance <= 0 ) return { runs , expected , zScore: NaN , pValue: NaN }
const zScore = ( runs - expected ) / Math . sqrt ( variance )
// Two-tailed p-value from standard normal (using the error function approximation)
const pValue = 2 * ( 1 - normalCDF ( Math . abs ( zScore )))
return { runs , expected: + expected . toFixed ( 4 ), zScore: + zScore . toFixed ( 4 ), pValue: + pValue . toFixed ( 4 ) }
}
Interpretation
Z-Score P-Value Interpretation -2 to +2 > 0.05 Pass: Sequence appears random < -2 < 0.05 Too few runs (clustering/streaks) > +2 < 0.05 Too many runs (oscillation)
Example
import { runsTest } from './engine/metrics.js'
// Random sequence (should pass)
const random = [ 1 , 0 , 1 , 1 , 0 , 1 , 0 , 0 , 1 , 0 ]
const result1 = runsTest ( random )
console . log ( `Runs: ${ result1 . runs } , Expected: ${ result1 . expected } , Z: ${ result1 . zScore } , p: ${ result1 . pValue } ` )
// Output: Runs: 8, Expected: 6.0, Z: 0.94, p: 0.35 (PASS: appears random)
// Streaky sequence (should fail)
const streaky = [ 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 0 ]
const result2 = runsTest ( streaky )
console . log ( `Runs: ${ result2 . runs } , Expected: ${ result2 . expected } , Z: ${ result2 . zScore } , p: ${ result2 . pValue } ` )
// Output: Runs: 2, Expected: 6.0, Z: -2.53, p: 0.01 (FAIL: too few runs, clustering detected)
Cold Streak Detection : If the runs test shows Z < -2, the model is producing streaky predictions rather than independent ones. This is a red flag for risk management.
Band Analysis
Classifies predictions into 5 confidence bands and computes per-band accuracy, Brier score, and mean probability.
Band Definitions
Band Label Range Confidence Distance 1 Ruido 45-55% |p - 0.5| < 0.05 2 Senal debil 55-65% |p - 0.5| 0.05-0.15 3 Senal moderada 65-75% |p - 0.5| 0.15-0.25 4 Senal fuerte 75-85% |p - 0.5| 0.25-0.35 5 Senal muy fuerte 85%+ |p - 0.5| ≥ 0.35
Implementation
export function bandAnalysis ( records ) {
const bands = [
{ band: 1 , label: 'Ruido' , range: '45-55%' , min: 0.00 , max: 0.05 , items: [] },
{ band: 2 , label: 'Senal debil' , range: '55-65%' , min: 0.05 , max: 0.15 , items: [] },
{ band: 3 , label: 'Senal moderada' , range: '65-75%' , min: 0.15 , max: 0.25 , items: [] },
{ band: 4 , label: 'Senal fuerte' , range: '75-85%' , min: 0.25 , max: 0.35 , items: [] },
{ band: 5 , label: 'Senal muy fuerte' , range: '85%+' , min: 0.35 , max: Infinity , items: [] },
]
for ( const record of records ) {
const ep = record . earlyPrediction
if ( ! ep || ep . abstained ) continue
const confidence = Math . abs ( ep . probability - 0.5 )
const correct = record . earlyPredictionCorrect
if ( correct == null ) continue
// Find the right band
for ( const b of bands ) {
if ( confidence >= b . min && confidence < b . max ) {
b . items . push ({ confidence , correct , record })
break
}
}
}
// Build scoring data per band for partial Brier calculation
return bands . map ( b => {
const count = b . items . length
if ( count === 0 ) {
return {
band: b . band , label: b . label , range: b . range ,
count: 0 , accuracy: '--' , brier: '--' , meanProb: '--'
}
}
const correctCount = b . items . filter ( i => i . correct ). length
const accuracy = (( correctCount / count ) * 100 ). toFixed ( 1 )
// Compute partial Brier for this band
const scoringItems = []
for ( const item of b . items ) {
const ep = item . record . earlyPrediction
if ( ep . direction === 'UP' ) {
scoringItems . push ({ predicted: ep . probability , outcome: item . record . result === 'UP' ? 1 : 0 })
} else if ( ep . direction === 'DOWN' ) {
scoringItems . push ({ predicted: 1 - ep . probability , outcome: item . record . result === 'DOWN' ? 1 : 0 })
}
}
const brier = scoringItems . length > 0 ? brierScore ( scoringItems ). toFixed ( 4 ) : '--'
// Mean effective confidence
const meanConf = b . items . reduce (( s , i ) => s + i . confidence , 0 ) / count
const meanProb = ( 0.5 + meanConf ). toFixed ( 2 )
return {
band: b . band , label: b . label , range: b . range ,
count , accuracy , brier , meanProb
}
})
}
src/engine/metrics.js (lines 192-270)
/**
* 5-band confidence analysis.
*
* Classifies early predictions into 5 bands based on confidence distance
* from 0.50, then computes per-band count, accuracy, partial Brier, and
* mean confidence.
*
* Band boundaries (mapped from raw probability distance from 0.5):
* Band 1: 45-55% (Ruido) — |p-0.5| < 0.05 → effective 0.50-0.55
* Band 2: 55-65% (Senal debil) — |p-0.5| 0.05-0.15 → effective 0.55-0.65
* Band 3: 65-75% (Senal moderada) — |p-0.5| 0.15-0.25 → effective 0.65-0.75
* Band 4: 75-85% (Senal fuerte) — |p-0.5| 0.25-0.35 → effective 0.75-0.85
* Band 5: 85%+ (Senal muy fuerte) — |p-0.5| >= 0.35 → effective 0.85+
*
* @param {Array<Object>} records IntervalRecord objects
* @returns {Array<{band: number, label: string, range: string, count: number, accuracy: string, brier: string, meanProb: string}>}
*/
export function bandAnalysis ( records ) {
// [implementation shown above]
}
Example Output
import { bandAnalysis } from './engine/metrics.js'
import { HistoryStore } from './tracker/history.js'
const history = new HistoryStore ({ filePath: 'data/history.json' })
const records = await history . load ()
const bands = bandAnalysis ( records )
console . table ( bands )
Band Label Range Count Accuracy Brier Mean Prob 1 Ruido 45-55% 23 52.2% 0.2489 0.52 2 Senal debil 55-65% 45 58.9% 0.2301 0.60 3 Senal moderada 65-75% 38 68.4% 0.1876 0.70 4 Senal fuerte 75-85% 12 75.0% 0.1123 0.80 5 Senal muy fuerte 85%+ 3 100.0% 0.0289 0.91
Calibration Check : If accuracy closely matches mean probability in each band, the model is well-calibrated. Large discrepancies indicate miscalibration.
Data Conversion
Convert IntervalRecord objects into scoring data format:
export function intervalsToScoringData ( records ) {
const data = []
for ( const record of records ) {
const ep = record . earlyPrediction
if ( ! ep || ep . abstained ) continue
const direction = ep . direction
const probability = ep . probability
if ( direction === 'UP' ) {
data . push ({
predicted: probability ,
outcome: record . result === 'UP' ? 1 : 0
})
} else if ( direction === 'DOWN' ) {
data . push ({
predicted: 1 - probability ,
outcome: record . result === 'DOWN' ? 1 : 0
})
}
}
return data
}
src/engine/metrics.js (lines 155-190)
/**
* Convert closed IntervalRecord objects into scoring data format.
*
* Uses earlyPrediction.probability as the predicted value.
* If direction='UP': predicted = probability, outcome = 1 when result='UP'.
* If direction='DOWN': predicted = 1 - probability, outcome = 1 when result='DOWN'.
* Skips records where earlyPrediction is null or has abstained flag.
*
* @param {Array<Object>} records IntervalRecord objects from history.json
* @returns {Array<{predicted: number, outcome: 0|1}>}
*/
export function intervalsToScoringData ( records ) {
const data = []
for ( const record of records ) {
const ep = record . earlyPrediction
if ( ! ep || ep . abstained ) continue
const direction = ep . direction
const probability = ep . probability
if ( direction === 'UP' ) {
data . push ({
predicted: probability ,
outcome: record . result === 'UP' ? 1 : 0
})
} else if ( direction === 'DOWN' ) {
data . push ({
predicted: 1 - probability ,
outcome: record . result === 'DOWN' ? 1 : 0
})
}
}
return data
}
Full Analysis Pipeline
import { HistoryStore } from './tracker/history.js'
import {
intervalsToScoringData ,
brierScore ,
brierSkillScore ,
logLoss ,
murphyDecomposition ,
runsTest ,
bandAnalysis
} from './engine/metrics.js'
// Load interval history
const history = new HistoryStore ({ filePath: 'data/history.json' })
const records = await history . load ()
// Convert to scoring format
const data = intervalsToScoringData ( records )
// Compute all metrics
const bs = brierScore ( data )
const bss = brierSkillScore ( bs )
const ll = logLoss ( data )
const murphy = murphyDecomposition ( data )
// Extract outcomes for runs test
const outcomes = data . map ( d => d . outcome )
const runs = runsTest ( outcomes )
// Band analysis
const bands = bandAnalysis ( records )
console . log ( '=== OVERALL METRICS ===' )
console . log ( `Brier Score: ${ bs . toFixed ( 4 ) } ` )
console . log ( `Brier Skill Score: ${ bss . toFixed ( 2 ) } ` )
console . log ( `Log Loss: ${ ll . toFixed ( 4 ) } ` )
console . log ()
console . log ( '=== MURPHY DECOMPOSITION ===' )
console . log ( `Reliability: ${ murphy . reliability . toFixed ( 4 ) } ` )
console . log ( `Resolution: ${ murphy . resolution . toFixed ( 4 ) } ` )
console . log ( `Uncertainty: ${ murphy . uncertainty . toFixed ( 4 ) } ` )
console . log ()
console . log ( '=== RUNS TEST ===' )
console . log ( `Observed: ${ runs . runs } runs` )
console . log ( `Expected: ${ runs . expected } runs` )
console . log ( `Z-Score: ${ runs . zScore } ` )
console . log ( `P-Value: ${ runs . pValue } ${ runs . pValue < 0.05 ? '(FAIL: not random)' : '(PASS: appears random)' } ` )
console . log ()
console . log ( '=== BAND ANALYSIS ===' )
console . table ( bands )
Interval Tracking How intervals are tracked and closed
History Store JSON persistence for interval records
Logging Structured logs and tick data