Overview
Daily reports provide deep statistical analysis of the bot’s performance. They’re generated as Markdown files with YAML frontmatter, compatible with Obsidian and other knowledge management tools.Generating Reports
Report Structure
Each report contains 10 major sections:- Resumen - Basic statistics and price range
- Scoring Metrics - Brier Score, Log Loss, Brier Skill Score
- Murphy Decomposition - Calibration quality breakdown
- Accuracy de predicciones - Final vs Early 1m performance
- Bandas de confianza - Performance by confidence level (5 bands)
- Runs Test - Statistical independence test
- Rachas - Win/loss streaks
- Abstenciones - Abstention analysis by reason
- Datos de Mercado - Polymarket market data coverage
- Gestion de Riesgo - Trade execution and PnL
- Fallos del Early 1m - Detailed breakdown of misses
- Observaciones - Key insights and recommendations
Section Breakdown
1. Resumen (Summary)
Intervalos analizados
Intervalos analizados
Total number of 5-minute intervals that closed during this day.Each interval represents one complete prediction cycle (5 minutes).
Resultado UP / DOWN
Resultado UP / DOWN
How many intervals closed above the strike (UP) vs below (DOWN).What it tells you:
- Market direction bias for the day
- If heavily skewed (70%+ one direction), model may have had easier/harder predictions
- Balanced split (45-55%) is typical for binary markets
Rango de precio
Rango de precio
Lowest and highest BTC close prices during the day.Use this to:
- Assess volatility (wider range = more volatile)
- Correlate with prediction accuracy (extreme volatility often reduces accuracy)
2. Scoring Metrics
Brier Score
Brier Score
Formula:
BS = (1/N) * Σ(p - outcome)²Measures calibration quality. Lower is better.Interpretation:- < 0.15 - Excellent calibration (elite performance)
- 0.15-0.20 - Good calibration (strong model)
- 0.20-0.25 - Acceptable (better than random)
- > 0.25 - Poor (no better than coin flip)
/home/daytona/workspace/source/src/engine/metrics.js:6 for implementation.Log Loss
Log Loss
Formula:
LL = -(1/N) * Σ[outcome*ln(p) + (1-outcome)*ln(1-p)]Penalizes confident wrong predictions more heavily than Brier.Interpretation:- < 0.50 - Excellent
- 0.50-0.69 - Good (better than baseline)
- > 0.69 - Poor (worse than random)
BSS (Brier Skill Score)
BSS (Brier Skill Score)
Formula:
BSS = 1 - (BS / BS_baseline)Measures improvement over random baseline.Interpretation:- > 0.20 - Excellent skill
- 0.10-0.20 - Good skill
- 0-0.10 - Marginal skill
- < 0 - Worse than random (model is broken)
3. Murphy Decomposition
Reliability
Reliability
What it measures: How well predicted probabilities match actual outcomes.Formula:
Reliability = Σ n_k * (p_k - o_k)²where p_k is mean forecast probability in bin k, o_k is mean outcome in bin k.Interpretation:- < 0.03 - Excellent calibration
- 0.03-0.05 - Good
- > 0.05 - Poorly calibrated
- Enable Platt calibration in config
- Check if model is overconfident
- Review abstention thresholds
Resolution
Resolution
What it measures: How well the model separates outcomes (discrimination power).Formula:
Resolution = Σ n_k * (o_k - ō)²Higher is better - means model can distinguish between UP and DOWN outcomes.Interpretation:- > 0.15 - Excellent discrimination
- 0.10-0.15 - Good
- < 0.10 - Weak discrimination
- Model may need more features (check momentum, volatility)
- May be too conservative (check abstention rate)
Uncertainty
Uncertainty
What it measures: Inherent unpredictability of the problem.Formula:
Uncertainty = ō * (1 - ō)where ō is the base rate (proportion of UP outcomes).This is fixed based on the data - you can’t change it. It represents the difficulty of the prediction task.Relationship: Brier Score = Uncertainty - Resolution + Reliability4. Accuracy Comparison
Why two accuracy measures?
Why two accuracy measures?
-
Final (30s) - Captured at 30 seconds before close
- Usually higher accuracy (more data)
- Too late to trade on (not enough time to execute)
- Useful for model validation
-
Early 1m (60s) - Captured at 60 seconds before close
- This is the real trading metric
- Practical signal you can act on
- Target: 80%+ for profitable trading
What if Early 1m < 80%?
What if Early 1m < 80%?
If Early 1m accuracy drops below 80%:
- Check data quality - Missing ticks? WebSocket issues?
- Review abstention rate - Is model being aggressive?
- Analyze confidence bands - Is high-confidence band still strong?
- Check market conditions - High volatility day?
- Run Murphy decomposition - Calibration or discrimination issue?
5. Confidence Bands (5 bands)
How bands work
How bands work
Each prediction is placed into one of 5 confidence bands based on its probability:
- Band 1: 50-60% (low confidence)
- Band 2: 60-70% (moderate)
- Band 3: 70-80% (good)
- Band 4: 80-90% (high confidence)
- Band 5: 90-100% (very high)
- Count - Number of predictions in this band
- Accuracy - What % were correct
- Brier - Calibration quality for this band
- Mean Prob - Average probability
What to look for
What to look for
Ideal pattern:
- Higher bands have higher accuracy
- Band 4-5 accuracy should be 85%+
- Brier Score decreases as band increases
- Band 5 accuracy < 85% (overconfident)
- Band 1 accuracy > 65% (underconfident, should be more aggressive)
- Bands 4-5 have very few predictions (too conservative)
6. Runs Test (Independence)
What is a runs test?
What is a runs test?
Tests whether prediction errors are randomly distributed or show patterns.Run = sequence of consecutive successes or failuresExample:Too few runs = clustering (model has systematic biases)
Too many runs = alternating pattern (model overcorrects)
Interpreting p-value
Interpreting p-value
-
p < 0.05 - Dependencia serial detectada
- Errors are not random
- Model has systematic biases (e.g., always misses trending markets)
- Action: Review momentum/reversion features
-
p >= 0.05 - Errores aparentemente aleatorios
- Good! Errors appear random
- Model isn’t missing obvious patterns
- Continue monitoring
7. Streaks (Rachas)
Win Streaks
Win Streaks
Longest consecutive correct Early 1m predictions.What it tells you:
- Model’s best performance period
- Confidence during hot streaks
- Potential for compound gains
Loss Streaks
Loss Streaks
Longest consecutive incorrect Early 1m predictions.Critical for risk management:
- If max loss streak = 4, you need bankroll to survive 4 consecutive losses
- Cold streak abstention triggers at 40% accuracy over rolling window
- Drawdown tracking prevents catastrophic loss
8. Abstenciones
Abstention Analysis
Abstention Analysis
Key metrics:
- Abstention Rate - What % of intervals we didn’t trade
- Accuracy sin abstenciones - Accuracy on intervals we DID predict
- < 10% - Model may be too aggressive, taking marginal bets
- 10-25% - Healthy selectivity
- > 25% - Model may be too conservative, missing opportunities
Abstention Reasons
Abstention Reasons
Most common reasons:
- insufficient_margin - Edge < 15pp (most common)
- dead_zone - Probability too close to 50%
- insufficient_ev - EV < 5%
- drawdown_suspension - In RED/CRITICAL drawdown
- cold_streak - Accuracy dropped below 40%
- insufficient_data - < 50 ticks collected
- anomalous_volatility - Volatility > 2x mean
- If
insufficient_margindominates, consider lowering threshold (but increases risk) - If
cold_streakappears often, model may need recalibration - If
anomalous_volatilityis high, check WebSocket feed quality
9. Market Data (Polymarket)
Market Coverage
Market Coverage
Intervalos con q_market - How many intervals had Polymarket data available.Target: 90%+Low coverage (<80%) indicates:
- Polymarket API connectivity issues
- Market wasn’t active during those intervals
- Bot started before market opened
q_market promedio
q_market promedio
Average Polymarket UP token price across all intervals.Interpretation:
- ~0.50 - Market is balanced, no directional bias
- > 0.60 - Market is bullish (expects UP outcomes)
- < 0.40 - Market is bearish (expects DOWN outcomes)
EV promedio (positivo)
EV promedio (positivo)
Average Expected Value on intervals where we found positive EV.Target: +5% minimumHigher is better:
- +10%+ - Excellent edge finding
- +5-10% - Good edges
- < +5% - Marginal edges, may not cover fees
Tasa de edge positivo
Tasa de edge positivo
Percentage of intervals where model found positive EV vs market.Example: 64.7% = we found +EV on 65% of available marketsInterpretation:
- > 60% - Model is good at finding mispriced markets
- 40-60% - Moderate edge detection
- < 40% - Model may not be adding value over market
10. Risk Management
Trades ejecutados
Trades ejecutados
Number of intervals where bet size > 0.Lower than total intervals due to abstentions.
Win rate (trades)
Win rate (trades)
Accuracy on intervals where we actually placed bets.Important distinction:
- Overall accuracy includes all predictions
- Trade win rate only counts predictions we bet on
Bet size promedio
Bet size promedio
Average bet size using fractional Kelly criterion.Formula:
Bet = alpha * Kelly * bankroll * (1 - drawdown_factor)Typical range: 100 bankrollFactors affecting size:- Model accuracy (Brier Score tier)
- Edge size (higher edge = bigger bet)
- Drawdown level (deeper drawdown = smaller bets)
- Bankroll size
PnL estimado
PnL estimado
Estimated profit/loss assuming:
- Win = +1 unit
- Loss = -1 unit
- Unit = bet size
PnL = Σ(wins * bet_size) - Σ(losses * bet_size)Note: This is a simulation. Real P&L depends on:- Actual token prices at execution
- Slippage and fees
- Order execution timing
11. Fallos del Early 1m
Miss Analysis
Miss Analysis
Each row shows an Early 1m prediction that was incorrect.What to look for:
- High confidence misses (80%+) - Most concerning, indicates miscalibration
- Patterns - Do misses cluster in trending vs ranging markets?
- Strike proximity - Are misses when price is very close to strike?
12. Observaciones
Auto-generated insights based on the day’s data:Using Reports for Optimization
Daily Review
Check Early 1m accuracy and Brier Score. If both are strong (>80% accuracy, BS <0.20), continue current config.
Weekly Trends
Compare 7 days of reports. Look for:
- Declining accuracy trends
- Increasing abstention rates
- Changing market coverage
Optimization Signals
Increase aggression if:
- Trade win rate > 85%
- Abstention rate > 30%
- High confidence bands (4-5) have excellent accuracy
- Trade win rate < 75%
- Brier Score > 0.22
- High confidence misses increasing
Report File Format
YAML Frontmatter
Storage Location
Reports are saved to:/home/daytona/workspace/source/src/reporter/daily.js:9-10
Next Steps
Reading Output
Learn to interpret the real-time console display
Troubleshooting
Fix common issues and improve performance