Skip to main content

Overview

olmOCR-Bench is an automated benchmark suite designed to effectively evaluate document-level parsing and OCR across various tools. It tests specific “facts” or “properties” about document pages at the PDF level.
We use PDFs directly because PDFs preserve digital metadata and information which is helpful and commonly available. Almost any other format can be converted to PDF, but not the reverse.

Property Classes

olmOCR-Bench evaluates four main categories of document properties:

Text Presence/Absence

Ensures that specific pieces of text (1-3 sentence level) are present or absent within parsed documents with high probability.
  • Tests documents with ambiguity around headers, footers, and other ambiguous content
  • Uses fuzzy matching to allow for minor variations
  • Validates extraction of critical document content
JSONL Format:
{
  "pdf": "document.pdf",
  "id": "rule_001",
  "type": "present",
  "text": "Expected text to find",
  "threshold": 0.9
}

Natural Reading Order

Verifies that blocks of text appear in the correct relative order within the document.
  • Ensures proper sequencing of content (e.g., article headings before article text)
  • Critical for multi-column layouts and complex page structures
  • Allows flexibility in ordering independent sections
JSONL Format:
{
  "pdf": "document.pdf",
  "id": "rule_002",
  "type": "order",
  "before": "Text that should appear first",
  "after": "Text that should appear second",
  "threshold": 0.95
}

Table Accuracy

Validates proper extraction and structuring of tabular data.
  • Checks accuracy on row/column/title basis
  • Ensures table structure is preserved
  • Verifies cell content extraction

Formula Accuracy

Evaluates mathematical formula extraction and rendering.
  • Extracts formulas from documents
  • Renders extracted formulas
  • Compares rendering using foundation models
Formula accuracy testing is currently in development.

Benchmark Creation Process

The olmOCR-Bench dataset is created through a systematic process:
1

Document Sampling

Sample documents from the same source as olmocrmix, focusing on pages with varied complexity
2

Differential Analysis

Run documents through two models and identify pages with:
  • Significant plain textual differences
  • Good text content (not just tables/formulas)
  • Interesting structural challenges
3

Property Extraction

Extract text presence/absence markers and verify using manual review UI
4

Rule Generation

Write validated rules to JSON format with embedding-based grouping for variation

Running Benchmarks

olmOCR-Bench is designed to be tool-agnostic and doesn’t depend on any specific output format.

Standard Workflow

1

Download Dataset

Download the benchmark dataset with all PDFs (single-page) to a /pdfs folder
2

Run Extraction

Run your OCR tool on the PDFs and save output to a folder (e.g., olmocr-v2_1/)Expected output: pdf_page1.md for /pdfs/pdf_page1.pdf
3

Run Evaluation

Execute the benchmark evaluation script
4

View Results

Review results and examine failing examples

Running Against Marker

pip install marker-pdf==1.5.4

Running Against GOT-OCR

pip install verovio torchvision

Running Against MinerU

conda create -n MinerU python=3.10
conda activate MinerU

Benchmark Script Usage

Run the benchmark evaluation script:
python olmocr/bench/benchmark.py --input_folder path/to/benchmark/data

Input Folder Structure

Your benchmark folder should be organized as:
benchmark_data/
├── pdfs/                 # Input PDF files
│   ├── doc1.pdf
│   ├── doc2.pdf
│   └── doc3.pdf
├── rules.jsonl          # Benchmark rules
├── marker/              # Marker tool output
│   ├── doc1_1.md
│   ├── doc1_2.md        # Multiple runs for same doc
│   └── doc2_1.md
└── olmocr/              # olmOCR output
    ├── doc1_1.md
    └── doc2_1.md

Output Format

The benchmark script outputs:
  • Per-Candidate Results: Detailed pass/fail for each rule
  • Overall Score: Average percentage across all rules
  • Rule Type Breakdown: Performance by property class
  • Failure Explanations: Specific reasons for failed rules
Example Output:
Candidate: marker
  [FAIL] Rule rule_001 on doc1 average pass ratio: 0.667 (2/3 repeats passed).
  Average Score: 87.5% over 24 rules.

Candidate: olmocr
  Average Score: 94.2% over 24 rules.

Final Summary:
marker              : Average Score: 87.5% over  24 rules
  Breakdown by rule type:
    present : 92.3% average pass rate over 12 rules
    order   : 81.7% average pass rate over 10 rules
    absent  : 95.0% average pass rate over 2 rules
    
olmocr              : Average Score: 94.2% over  24 rules
  Breakdown by rule type:
    present : 96.8% average pass rate over 12 rules
    order   : 90.5% average pass rate over 10 rules
    absent  : 97.5% average pass rate over 2 rules

Multiple Generations

olmOCR-Bench supports evaluating multiple runs of the same document:
  • Name files with suffixes: doc1_1.md, doc1_2.md, doc1_3.md
  • Scores are averaged across all generations
  • Useful for evaluating consistency and stability
Running multiple generations helps identify tools that produce inconsistent results across runs.

Running Evaluations

Learn how to run detailed evaluations with metrics

Comparing Results

Analyze and compare evaluation results

Build docs developers (and LLMs) love