Overview
Before deploying your agent to production, it’s crucial to evaluate its performance. The benchmarking system allows you to test your agent against real prediction markets where humans are trading, providing a reliable measure of accuracy without risking real funds.Why Benchmark?
Validate Accuracy
Measure your agent’s prediction accuracy against resolved markets
Compare Performance
See how your agent stacks up against human traders
Fast Iteration
Test changes quickly without waiting for real markets to resolve
Risk-Free Testing
Evaluate performance without spending money on bets
Benchmark Script
Thescripts/simple_benchmark.py script fetches markets from Manifold (where humans trade) and runs your agents against them to generate a performance report.
Script Overview
scripts/simple_benchmark.py
Running Benchmarks
Basic Usage
Run a benchmark with 10 markets:Command Line Options
Number of markets to fetch and test against
Output path for the markdown report
Path to cache predictions for faster re-runs
Only use cached predictions, don’t make new ones
Examples
Adding Your Agent
To benchmark your custom agent, modify thebenchmarker initialization in simple_benchmark.py:
Understanding the Report
The benchmark generates a markdown report with detailed statistics:Sample Report Structure
Key Metrics
Accuracy
Accuracy
Percentage of markets where your agent’s prediction was correct.Calculation: Correct predictions / Total predictionsTarget: >50% (better than random), >60% (good), >70% (excellent)
Confidence
Confidence
Average confidence level of your agent’s predictions.Range: 0.0 to 1.0Insight: High confidence with low accuracy suggests overconfidence; low confidence with high accuracy suggests your agent is too conservative.
Markets
Markets
Number of markets your agent provided predictions for.Note: If lower than total markets, your agent returned
None for some questions.Evaluation Methods
1. Benchmark Against Manifold
The standard approach - test against Manifold markets where humans trade:- Fast feedback
- No cost
- Compare against human traders
- Limited to resolved markets
- May not reflect Omen market characteristics
2. Live Trading with Small Bets
Deploy your agent with tiny bets to evaluate on real markets:- Tests on actual target platform (Omen)
- Real market conditions
- Builds trading history
- Slower feedback (days to weeks)
- Costs real money (but minimal)
- Requires deployed infrastructure
3. Manual Observation
Watch your agent’s reasoning on specific questions:- Deep insight into agent’s logic
- Helps debug issues
- Can use Streamlit app for interactive testing
- Time-consuming
- Not statistically significant
- Subjective evaluation
Benchmarking Best Practices
Test Multiple Sizes
Run benchmarks with 10, 50, and 100 markets to ensure consistency
Use Caching
Cache results when iterating to avoid redundant API calls
Track Over Time
Keep benchmark reports to monitor improvements
Compare Baselines
Always include simple agents like CoinFlip for reference
Diverse Markets
Test on markets from different categories and time periods
Check Confidence
High accuracy with low confidence may indicate your agent could bet more
Interpreting Results
What’s a Good Score?
50-55% Accuracy
50-55% Accuracy
Status: Slightly better than randomAction: Your agent has potential but needs improvement. Focus on:
- Better data sources
- Improved prompt engineering
- More context for predictions
55-60% Accuracy
55-60% Accuracy
Status: Decent performanceAction: Deploy with small bets to test on live markets. Look for:
- Specific market types where you excel
- Opportunities to improve confidence calibration
60-70% Accuracy
60-70% Accuracy
Status: Good performanceAction: Deploy to production with Kelly betting strategy. This range indicates:
- Strong prediction capability
- Potential for profitability
- Ready for larger bet sizes
70%+ Accuracy
70%+ Accuracy
Status: Excellent performanceAction: Deploy aggressively and optimize for scale. Consider:
- Increasing bet_on_n_markets_per_run
- Raising maximum bet amounts
- Sharing insights with community
Cost-Benefit Analysis
Calculate if your agent will be profitable:Advanced Benchmarking
Custom Market Selection
Test on specific market types:Time-Based Analysis
Evaluate performance on recent vs. older markets:Confidence Calibration
Check if your confidence levels match actual accuracy:Common Issues
Agent timing out on benchmark
Agent timing out on benchmark
Cause: Agent takes too long to make predictionsSolution:
- Reduce number of web searches
- Use faster LLM models
- Implement timeout handling
- Cache intermediate results
Low accuracy on benchmark but good on live markets
Low accuracy on benchmark but good on live markets
Cause: Manifold markets differ from Omen marketsSolution:
- Focus on live market evaluation
- Filter Manifold markets to match Omen characteristics
- Use both methods for comprehensive evaluation
Agent returns None for many markets
Agent returns None for many markets
Cause:
verify_market() or prediction logic rejecting marketsSolution:- Review market filtering logic
- Check for API errors
- Ensure data sources are accessible
- Add better error handling
Next Steps
Deploy to Production
Ready to go live? Deploy your agent
Hackathon Guide
Building for a hackathon? Check the quickstart