Running the Analyzer
The analyze command provides comprehensive analysis of benchmark results:Interactive File Selection
When you run the analyzer, it usesquestionary to provide an interactive file selection interface:
Analysis Types
The analyzer automatically runs three types of analysis on the selected result file:1. Accuracy Analysis
Theanalyze_acc() function calculates the success rate for each benchmark:
- Counts the number of
"success"status entries - Divides by total number of attempts
- Displays as a percentage
2. Reasoning Pattern Analysis
Thecount_reasoning_patterns() function searches for specific patterns in reasoning traces:
"wait""pause""hold on""actually""no,"
This analysis only works with result files that contain reasoning traces. If no reasoning data is found, you’ll see:
3. Reasoning Length Statistics
Thereasoning_lenth_stats() function provides detailed statistics about reasoning traces:
Mean character count across all reasoning traces
Mean word count across all reasoning traces
Median character count (middle value when sorted)
Median word count (middle value when sorted)
Shortest reasoning trace in characters
Shortest reasoning trace in words
Longest reasoning trace in characters
Longest reasoning trace in words
Mean length of individual words across all traces
Median length of individual words
Customizing Pattern Analysis
You can modify the patterns being searched by editinganalyze_results.py:
Understanding the Results
These analyses help you understand:- Model reliability: Higher accuracy percentages indicate better performance
- Reasoning behavior: Pattern frequency shows how the model thinks through problems
- Reasoning efficiency: Length statistics reveal whether the model is concise or verbose
- Model comparison: Run analysis on multiple result files to compare different models or configurations