Skip to main content

Overview

The olmOCR evaluation system (runeval.py) provides detailed quantitative analysis of OCR and document parsing quality by comparing outputs against gold standard data. It uses the DocumentEditSimilarity metric from the Dolma Refine project.

Installation

Install the required evaluation dependencies:
pip install git+https://github.com/allenai/refine.git@soldni/eval-m
This provides access to the alignment scoring and metrics used in the evaluation pipeline.

DocumentEditSimilarity Metric

The evaluation uses DocumentEditSimilarity, a sophisticated metric that:
Uses SpacySegmenter to break documents into semantic units (sentences, paragraphs)
Employs HirschbergAligner to align gold and predicted text segments:
  • Match score: +1 for matching segments
  • Mismatch score: -1 for differing segments
  • Indel score: -1 for insertions/deletions
Computes edit similarity score between 0.0 and 1.0:
  • 1.0: Perfect match
  • 0.9-0.99: Excellent quality
  • 0.7-0.89: Good quality
  • Below 0.7: Poor quality

Metric Configuration

from dolma_refine.evaluate.segmenters import SpacySegmenter
from dolma_refine.evaluate.aligners import HirschbergAligner
from dolma_refine.evaluate.metrics import DocumentEditSimilarity

segmenter = SpacySegmenter("spacy")
aligner = HirschbergAligner(
    match_score=1, 
    mismatch_score=-1, 
    indel_score=-1
)
comparer = DocumentEditSimilarity(
    segmenter=segmenter, 
    aligner=aligner
)

alignment_score = comparer.compute(gold_text, eval_text)

Data Format

The evaluation system accepts multiple input formats:

Normalized Format

{
  "s3_path": "s3://bucket/path/to/document.pdf",
  "pagenum": 1,
  "text": "Extracted text content...",
  "finish_reason": "stop",
  "error": null
}

OpenAI Batch API Format

{
  "custom_id": "s3://bucket/path/to/document.pdf-1",
  "response": {
    "body": {
      "choices": [{
        "message": {
          "content": "{\"natural_text\": \"Extracted text...\"}"
        },
        "finish_reason": "stop"
      }]
    }
  }
}

Birr Format

{
  "custom_id": "s3://bucket/path/to/document.pdf-1",
  "outputs": [{
    "text": "{\"natural_text\": \"Extracted text...\"}",
    "finish_reason": "stop"
  }],
  "completion_error": null
}
The goldkey format is {s3_path}-{pagenum}, e.g., s3://ai2-s2-pdfs/39ce/3db4.pdf-4 for page 4.

Running Evaluations

Basic Usage

python olmocr/eval/runeval.py \
  --name experiment_name \
  --review_size 20 \
  path/to/gold/data \
  path/to/eval/data

Command-Line Arguments

--name
string
default:"review_page"
Name for this evaluation (used in output files)
--review_size
integer
default:"20"
Number of entries to include in generated HTML review pages
gold_data_path
string
required
Path to gold standard JSONL files (local or S3 URL)
eval_data_path
string
required
Path to evaluation JSONL files to compare (local or S3 URL)

Example: Evaluating Marker Output

python olmocr/eval/runeval.py \
  --name marker_eval \
  --review_size 50 \
  s3://my-bucket/gold-data/ \
  s3://my-bucket/marker-output/

Example: Evaluating GOT-OCR Output

python olmocr/eval/runeval.py \
  --name gotocr_eval \
  --review_size 50 \
  /path/to/gold/data \
  /path/to/gotocr/output

Example: Evaluating MinerU Output

python olmocr/eval/runeval.py \
  --name mineru_eval \
  --review_size 50 \
  /path/to/gold/data \
  /path/to/mineru/output

Evaluation Workflow

1

Load Gold Data

The system loads all JSONL files from the gold data path in parallel:
gold_data = load_gold_data(gold_data_path)
# Returns: {"s3://path/to/doc.pdf-1": "gold text..."}
Tracks:
  • Total entries loaded
  • Processing errors
  • Overrun errors (incomplete generations)
2

Process Evaluation Data

For each JSONL file in the evaluation path:
process_jsonl_file(jsonl_file, gold_data, comparer)
Computes:
  • Alignment score per page
  • Aggregate statistics
  • Error tracking
3

Calculate Metrics

Two primary metrics are calculated:Page-Weighted Score: Simple average across all pages
mean_page_score = total_alignment_score / total_pages
Character-Weighted Score: Weighted by document length
mean_char_score = total_char_alignment_score / total_chars
4

Generate Review Pages

Creates HTML visualization pages:
  • {name}_worst.html: Lowest scoring pages for debugging
  • {name}_sample.html: Random sample for quality review

Understanding Output

Console Output

The evaluation prints detailed statistics:
Loaded 15,234 gold data entries for comparison
Gold processing errors: 12
Gold overrun errors: 8
-----------------------------------------------------------
Found 3 files to evaluate
Compared 15,180 pages
Found 15 errors in the eval set, and 23 cases of length overruns
Mean page-weighted alignment: 0.924
Mean char-weighted alignment: 0.931

...creating review page

Key Metrics Explained

Loaded entries
integer
Number of valid gold standard pages loaded
Processing errors
integer
Pages that failed to process in gold data
Overrun errors
integer
Pages where generation exceeded length limits
Compared pages
integer
Total pages successfully compared
Page-weighted alignment
float
Average similarity score (each page weighted equally)
Char-weighted alignment
float
Average similarity score (weighted by document length)

When to Use Each Metric

Page-Weighted

Use when all pages are equally important, regardless of length. Better for evaluating consistency across diverse documents.

Char-Weighted

Use when longer documents are more important. Better reflects overall quality of extracted content by volume.

Error Handling

The system gracefully handles various error conditions:

Processing Errors

When a page fails to process:
if data.error is not None:
    eval_text = f"[Error processing this page: {data.error}]"

Overrun Errors

When generation exceeds limits:
if data.finish_reason != "stop":
    eval_text += f"\n[Error processing this page: overrun {data.finish_reason}]"

Empty Documents

Empty documents are handled specially:
if len(gold_text.strip()) < 3 and len(eval_text.strip()) < 3:
    alignment = 1.0  # Both empty = perfect match

Performance Optimization

The evaluation system uses parallel processing for efficiency:

Multithreaded Loading

with ThreadPoolExecutor(max_workers=8) as executor:
    futures = [executor.submit(process_file, path) for path in gold_jsonl_files]

Multiprocess Evaluation

with ProcessPoolExecutor() as executor:
    futures = [executor.submit(process_jsonl_file, file, gold_data, comparer) 
               for file in jsonl_files]
For large evaluations, the system automatically uses ProcessPoolExecutor for parallel processing. In debug mode, it switches to ThreadPoolExecutor for easier debugging.

Working with S3 Data

The evaluation system seamlessly handles S3 data:
# Both paths can be S3 URLs
python olmocr/eval/runeval.py \
  s3://bucket/gold-data/ \
  s3://bucket/eval-data/

# Mixed local and S3
python olmocr/eval/runeval.py \
  /local/path/gold-data/ \
  s3://bucket/eval-data/
Supported S3 file formats:
  • .json, .jsonl
  • .json.zstd, .jsonl.zstd (compressed)
  • .json.zst, .jsonl.zst (compressed)

Next Steps

View Results

Learn how to analyze HTML review pages and compare results

ELO Scoring

Generate ELO rankings for multiple tools

Build docs developers (and LLMs) love