Overview
olmOCR-Bench is an automated benchmark suite designed to effectively evaluate document-level parsing and OCR across various tools. It tests specific “facts” or “properties” about document pages at the PDF level.We use PDFs directly because PDFs preserve digital metadata and information which is helpful and commonly available. Almost any other format can be converted to PDF, but not the reverse.
Property Classes
olmOCR-Bench evaluates four main categories of document properties:Text Presence/Absence
Ensures that specific pieces of text (1-3 sentence level) are present or absent within parsed documents with high probability.- Tests documents with ambiguity around headers, footers, and other ambiguous content
- Uses fuzzy matching to allow for minor variations
- Validates extraction of critical document content
Natural Reading Order
Verifies that blocks of text appear in the correct relative order within the document.- Ensures proper sequencing of content (e.g., article headings before article text)
- Critical for multi-column layouts and complex page structures
- Allows flexibility in ordering independent sections
Table Accuracy
Validates proper extraction and structuring of tabular data.- Checks accuracy on row/column/title basis
- Ensures table structure is preserved
- Verifies cell content extraction
Formula Accuracy
Evaluates mathematical formula extraction and rendering.- Extracts formulas from documents
- Renders extracted formulas
- Compares rendering using foundation models
Formula accuracy testing is currently in development.
Benchmark Creation Process
The olmOCR-Bench dataset is created through a systematic process:Document Sampling
Sample documents from the same source as olmocrmix, focusing on pages with varied complexity
Differential Analysis
Run documents through two models and identify pages with:
- Significant plain textual differences
- Good text content (not just tables/formulas)
- Interesting structural challenges
Running Benchmarks
olmOCR-Bench is designed to be tool-agnostic and doesn’t depend on any specific output format.Standard Workflow
Run Extraction
Run your OCR tool on the PDFs and save output to a folder (e.g.,
olmocr-v2_1/)Expected output: pdf_page1.md for /pdfs/pdf_page1.pdfRunning Against Marker
Running Against GOT-OCR
Running Against MinerU
Benchmark Script Usage
Run the benchmark evaluation script:Input Folder Structure
Your benchmark folder should be organized as:Output Format
The benchmark script outputs:- Per-Candidate Results: Detailed pass/fail for each rule
- Overall Score: Average percentage across all rules
- Rule Type Breakdown: Performance by property class
- Failure Explanations: Specific reasons for failed rules
Multiple Generations
olmOCR-Bench supports evaluating multiple runs of the same document:- Name files with suffixes:
doc1_1.md,doc1_2.md,doc1_3.md - Scores are averaged across all generations
- Useful for evaluating consistency and stability
Related
Running Evaluations
Learn how to run detailed evaluations with metrics
Comparing Results
Analyze and compare evaluation results