Overview
The olmOCR evaluation system (runeval.py) provides detailed quantitative analysis of OCR and document parsing quality by comparing outputs against gold standard data. It uses the DocumentEditSimilarity metric from the Dolma Refine project.
Installation
Install the required evaluation dependencies:DocumentEditSimilarity Metric
The evaluation uses DocumentEditSimilarity, a sophisticated metric that:Segmentation
Segmentation
Uses SpacySegmenter to break documents into semantic units (sentences, paragraphs)
Alignment
Alignment
Employs HirschbergAligner to align gold and predicted text segments:
- Match score: +1 for matching segments
- Mismatch score: -1 for differing segments
- Indel score: -1 for insertions/deletions
Similarity Calculation
Similarity Calculation
Computes edit similarity score between 0.0 and 1.0:
- 1.0: Perfect match
- 0.9-0.99: Excellent quality
- 0.7-0.89: Good quality
- Below 0.7: Poor quality
Metric Configuration
Data Format
The evaluation system accepts multiple input formats:Normalized Format
OpenAI Batch API Format
Birr Format
The goldkey format is
{s3_path}-{pagenum}, e.g., s3://ai2-s2-pdfs/39ce/3db4.pdf-4 for page 4.Running Evaluations
Basic Usage
Command-Line Arguments
Name for this evaluation (used in output files)
Number of entries to include in generated HTML review pages
Path to gold standard JSONL files (local or S3 URL)
Path to evaluation JSONL files to compare (local or S3 URL)
Example: Evaluating Marker Output
Example: Evaluating GOT-OCR Output
Example: Evaluating MinerU Output
Evaluation Workflow
Load Gold Data
The system loads all JSONL files from the gold data path in parallel:Tracks:
- Total entries loaded
- Processing errors
- Overrun errors (incomplete generations)
Process Evaluation Data
For each JSONL file in the evaluation path:Computes:
- Alignment score per page
- Aggregate statistics
- Error tracking
Calculate Metrics
Two primary metrics are calculated:Page-Weighted Score: Simple average across all pagesCharacter-Weighted Score: Weighted by document length
Understanding Output
Console Output
The evaluation prints detailed statistics:Key Metrics Explained
Number of valid gold standard pages loaded
Pages that failed to process in gold data
Pages where generation exceeded length limits
Total pages successfully compared
Average similarity score (each page weighted equally)
Average similarity score (weighted by document length)
When to Use Each Metric
Page-Weighted
Use when all pages are equally important, regardless of length. Better for evaluating consistency across diverse documents.
Char-Weighted
Use when longer documents are more important. Better reflects overall quality of extracted content by volume.
Error Handling
The system gracefully handles various error conditions:Processing Errors
When a page fails to process:Overrun Errors
When generation exceeds limits:Empty Documents
Empty documents are handled specially:Performance Optimization
The evaluation system uses parallel processing for efficiency:Multithreaded Loading
Multiprocess Evaluation
Working with S3 Data
The evaluation system seamlessly handles S3 data:.json,.jsonl.json.zstd,.jsonl.zstd(compressed).json.zst,.jsonl.zst(compressed)
Next Steps
View Results
Learn how to analyze HTML review pages and compare results
ELO Scoring
Generate ELO rankings for multiple tools