Overview
The Evaluation API provides comprehensive tools for assessing OCR model quality through alignment metrics, human review interfaces, and pairwise comparisons. It supports evaluation against gold standard datasets and side-by-side method comparisons.Core Modules
runeval - Gold Standard Evaluation
Evaluate OCR outputs against gold standard annotations using document edit similarity metrics. Source:olmocr/eval/runeval.py
buildelo - Pairwise Method Comparison
Generate comparison pages for ranking different OCR extraction methods. Source:olmocr/eval/buildelo.py
Main Functions
do_eval
Compare OCR outputs against gold standard data and generate review HTML pages.Path to gold standard JSONL files (local directory or S3 path)
Path to evaluation JSONL files to compare
Base name for generated HTML review pages
Number of samples to include in review pages
Character-weighted mean alignment score (0-1 range)
List of page comparison data with alignment scores
olmocr/eval/runeval.py:278
Generates two HTML files:
{review_page_name}_worst.html- Lowest alignment samples{review_page_name}_sample.html- Random sample of results
load_gold_data
Load gold standard annotations from JSONL files with multithreaded processing.Path to directory containing gold JSONL files
Maximum number of threads for parallel loading
Dictionary mapping gold keys (e.g.,
"s3://path/file.pdf-4") to text contentolmocr/eval/runeval.py:140
normalize_json_entry
Normalize different JSONL formats (OpenAI, Birr, SGLang) into a common structure.Raw JSONL entry from any supported format
Normalized entry with consistent fields
olmocr/eval/runeval.py:84
process_jsonl_file
Process a single JSONL file and compute alignment scores against gold data.Path to JSONL file to process
Gold standard data dictionary
Metric calculator instance
Returns (total_score, char_weighted_score, total_chars, total_pages, errors, overruns, page_data)
olmocr/eval/runeval.py:231
Data Classes
NormalizedEntry
Standardized representation of OCR output for any input format. Source:olmocr/eval/runeval.py:66
S3 path to source PDF document
Page number (1-indexed)
Extracted OCR text content
Completion status (“stop”, “length”, etc.)
Error message if processing failed
Properties
Combined identifier:
"{s3_path}-{pagenum}"Static Methods
from_goldkey
Construct NormalizedEntry from a gold key string.Gold key in format
"s3://path/file.pdf-5"Additional fields (text, finish_reason, error)
Comparison (buildelo)
Pairwise comparison between two OCR extraction methods. Source:olmocr/eval/buildelo.py:19
Path to source PDF
Path to first method’s output markdown
Path to second method’s output markdown
First method’s extracted text
Second method’s extracted text
Similarity score between the two methods (0-1)
Properties
Extracted method name from path (e.g., “pdelf”, “marker”)
Extracted method name from path
Evaluation Metrics
Document Edit Similarity
The evaluation uses the Hirschberg alignment algorithm with document edit similarity:Score for matching segments
Penalty for mismatched segments
Penalty for insertions/deletions
Scoring Interpretation
Exact alignment, no differences
Minor differences, high quality extraction
Some differences, acceptable quality
Noticeable differences, needs improvement
Significant differences, major issues
Evaluation Workflow
Pairwise Comparison
process_single_pdf
Process a single PDF and generate all pairwise comparisons between methods.Path to PDF file
Set of all available markdown output files
List of method names to compare (e.g., [“pdelf”, “marker”, “gotocr_format”])
Name of text segmenter to use
List of Comparison objects for all method pairs
olmocr/eval/buildelo.py:43
build_review_page
Generate HTML review page from comparison results.Command-line arguments including name and configuration
List of comparison objects to include
Report index for multi-page generation
olmocr/eval/buildelo.py:83
Supported Input Formats
OpenAI Batch API Format
Birr Format
SGLang Format
Command-Line Usage
Running Evaluation
Building Comparison Pages
Best Practices
Data Preparation
Data Preparation
- Ensure gold data has
finish_reason == "stop"for valid entries - Filter out length overruns and errors before evaluation
- Use consistent PDF paths and page numbering (1-indexed)
- Compress large datasets with zstandard (.zst extension)
Performance Optimization
Performance Optimization
- Increase
max_workersfor faster parallel processing - Cache gold data locally to avoid repeated S3 downloads
- Use ProcessPoolExecutor for CPU-bound comparison tasks
- Filter high-similarity pairs (>0.96) from comparison reviews
Result Analysis
Result Analysis
- Review both worst cases and random samples
- Look for systematic errors in low-scoring pages
- Compare character-weighted vs page-weighted metrics
- Track error rates and overrun frequencies
Output Reports
Generated HTML review pages include:- Side-by-side comparison of gold and predicted text
- Alignment score for each page
- PDF rendering for visual reference
- Method names and metadata
- Filterable and sortable interface
See Also
- Training API - Model training functions
- Data Loading API - Dataset preparation
- Evaluation Guide - Detailed evaluation workflows