Overview
olmOCR provides powerful tools for comparing and analyzing evaluation results. The system generates interactive HTML review pages for side-by-side comparisons and supports ELO scoring for ranking multiple OCR tools.HTML Evaluation Viewer
The HTML evaluation viewer provides an interactive interface for reviewing and comparing OCR results against gold standard data.Review Page Types
Evaluations generate two types of review pages:Worst Results
{name}_worst.html - Shows the lowest scoring pages for debugging and identifying systematic issuesRandom Sample
{name}_sample.html - Random selection for overall quality assessmentFeatures
The HTML viewer includes:- PDF Preview: Rendered page image for visual reference
- Side-by-Side Comparison: Gold standard vs. evaluation output
- Diff Highlighting: Visual differences with color coding
- Alignment Scores: Quantitative similarity metrics
- Randomized Display: Prevents bias by randomly positioning gold/eval text
- Metadata Labels: Shows which method produced each output
HTML Structure
Each entry in the review page displays:Diff Highlighting
The diff view uses color coding to show differences:Added Content
Added Content
Green highlighting indicates text present in the evaluation output but not in the reference
Removed Content
Removed Content
Red highlighting indicates text missing from the evaluation output
Replaced Content
Replaced Content
Red followed by green indicates text that was changed between versions
Diff Generation
The diff is generated using Python’sSequenceMatcher:
ELO Scoring System
The ELO scoring system (buildelo.py) generates pairwise comparisons between different OCR methods to create relative quality rankings.
How It Works
Generate Comparisons
For each PDF, create all possible pairwise comparisons between different parsing methods:
Filter Similar Results
Remove comparisons that are too similar (alignment > 0.96) as they provide little information
Running ELO Evaluation
Command-Line Arguments
Base name for generated HTML files
Number of comparisons per review page
List of method names to compare (must match filename suffixes)
Number of review pages to generate. If > 1, files are named
{name}_0.html, {name}_1.html, etc.Number of parallel worker processes (defaults to CPU count)
Path to folder containing comparison outputs (expects
*.md, *.pdf, *.png files)File Naming Convention
The ELO system expects files to follow this pattern:_marker.md → marker).
Example: Comparing Four Tools
ocr_comparison_0.html- 100 pairwise comparisonsocr_comparison_1.html- Next 100 comparisonsocr_comparison_2.html- Next 100 comparisonsocr_comparison_3.html- Next 100 comparisonsocr_comparison_4.html- Next 100 comparisons
Generating multiple copies allows for distributed human evaluation across team members.
Comparison Data Structure
Each comparison contains:Analyzing Results
Identifying Systematic Issues
Use the worst results page to identify patterns:Categorize Errors
Group errors by type:
- Missing content (red highlighting)
- Hallucinated content (green highlighting)
- Structural issues (incorrect ordering)
- Formatting problems (table/formula extraction)
Quality Assessment Guidelines
Score interpretation:Excellent
0.95 - 1.00Near-perfect extraction with minor differences in whitespace or formatting
Good
0.85 - 0.94High-quality extraction with some missing or extra content
Fair
0.70 - 0.84Acceptable quality but notable content issues
Poor
0.50 - 0.69Significant content missing or incorrect
Failed
0.00 - 0.49Severe extraction failures or wrong content
Error
N/AProcessing error or timeout
Performance Optimization
The comparison system uses parallel processing for efficiency:Parallel PDF Processing
Resource Requirements
For large evaluations:- Memory: ~2GB per worker process
- CPU: Scales linearly with
--max_workers - Disk: Minimal (HTML outputs only)
- Network: Required for S3 access
Viewing HTML Pages
Local Viewing
Simply open the HTML files in a web browser:Serving Over HTTP
For remote access or team collaboration:Sharing Results
The HTML files are self-contained and can be:- Uploaded to internal web servers
- Shared via cloud storage (Dropbox, Google Drive)
- Emailed directly (if file size permits)
- Hosted on GitHub Pages for public benchmarks
Best Practices
Sample Size
Sample Size
- Use
--review_size 50-100for thorough manual review - Generate multiple copies (
--num_copies 3-5) for inter-rater reliability - Always review both worst and random samples
Comparison Selection
Comparison Selection
- Include at least 3-4 methods for meaningful ELO rankings
- Filter out very similar results (alignment > 0.96) to focus on differences
- Randomize comparison order to prevent bias
Error Analysis
Error Analysis
- Start with worst results to identify failure modes
- Cross-reference with benchmark property tests
- Document systematic issues for targeted improvements
Result Validation
Result Validation
- Compare page-weighted vs char-weighted metrics
- Verify evaluation data matches expected format
- Check for processing errors and overruns
Troubleshooting
Low Alignment Scores
If seeing unexpectedly low scores:- Check text format: Ensure evaluation output is plain text, not JSON/XML wrapped
- Verify goldkey matching: Confirm
{s3_path}-{pagenum}format is consistent - Inspect HTML diffs: Look for systematic formatting differences
- Review empty documents: Empty gold + empty eval = 1.0, but empty gold + content eval = 0.0
Missing Comparisons
If expected comparisons don’t appear:- Verify file naming: Check that suffixes match
--comparisonsargument - Check S3 paths: Ensure glob patterns match file locations
- Review filter threshold: Lower from 0.96 if too many comparisons filtered out
Performance Issues
For slow evaluation:- Reduce review size: Use smaller
--review_sizefor testing - Adjust workers: Try different
--max_workersvalues - Use local data: Download S3 data locally for faster access
- Check S3 bandwidth: Network may be bottleneck for S3 operations
Related
olmOCR-Bench
Return to benchmark suite documentation
Running Evals
Learn about evaluation execution