Skip to main content

Overview

The Evaluation API provides comprehensive tools for assessing OCR model quality through alignment metrics, human review interfaces, and pairwise comparisons. It supports evaluation against gold standard datasets and side-by-side method comparisons.

Core Modules

runeval - Gold Standard Evaluation

Evaluate OCR outputs against gold standard annotations using document edit similarity metrics. Source: olmocr/eval/runeval.py

buildelo - Pairwise Method Comparison

Generate comparison pages for ranking different OCR extraction methods. Source: olmocr/eval/buildelo.py

Main Functions

do_eval

Compare OCR outputs against gold standard data and generate review HTML pages.
gold_data_path
str
required
Path to gold standard JSONL files (local directory or S3 path)
eval_data_path
str
required
Path to evaluation JSONL files to compare
review_page_name
str
required
Base name for generated HTML review pages
review_page_size
int
required
Number of samples to include in review pages
alignment_score
float
Character-weighted mean alignment score (0-1 range)
page_eval_data
list[dict]
List of page comparison data with alignment scores
Source: olmocr/eval/runeval.py:278
from olmocr.eval.runeval import do_eval

# Evaluate against gold standard
score, eval_data = do_eval(
    gold_data_path="s3://bucket/gold-data/",
    eval_data_path="s3://bucket/model-outputs/",
    review_page_name="model_v1_eval",
    review_page_size=50
)

print(f"Mean alignment: {score:.3f}")
Generates two HTML files:
  • {review_page_name}_worst.html - Lowest alignment samples
  • {review_page_name}_sample.html - Random sample of results

load_gold_data

Load gold standard annotations from JSONL files with multithreaded processing.
gold_data_path
str
required
Path to directory containing gold JSONL files
max_workers
int
default:"8"
Maximum number of threads for parallel loading
gold_data
dict[str, str]
Dictionary mapping gold keys (e.g., "s3://path/file.pdf-4") to text content
Source: olmocr/eval/runeval.py:140
from olmocr.eval.runeval import load_gold_data

gold_data = load_gold_data(
    gold_data_path="s3://bucket/gold/",
    max_workers=16
)

print(f"Loaded {len(gold_data)} gold entries")

normalize_json_entry

Normalize different JSONL formats (OpenAI, Birr, SGLang) into a common structure.
data
dict
required
Raw JSONL entry from any supported format
entry
NormalizedEntry
Normalized entry with consistent fields
Source: olmocr/eval/runeval.py:84
import json
from olmocr.eval.runeval import normalize_json_entry

with open("openai_response.jsonl") as f:
    for line in f:
        data = json.loads(line)
        entry = normalize_json_entry(data)
        print(f"{entry.s3_path} page {entry.pagenum}: {entry.text[:100]}")

process_jsonl_file

Process a single JSONL file and compute alignment scores against gold data.
jsonl_file
str
required
Path to JSONL file to process
gold_data
dict
required
Gold standard data dictionary
comparer
DocumentEditSimilarity
required
Metric calculator instance
results
tuple
Returns (total_score, char_weighted_score, total_chars, total_pages, errors, overruns, page_data)
Source: olmocr/eval/runeval.py:231

Data Classes

NormalizedEntry

Standardized representation of OCR output for any input format. Source: olmocr/eval/runeval.py:66
s3_path
str
S3 path to source PDF document
pagenum
int
Page number (1-indexed)
text
str | None
Extracted OCR text content
finish_reason
str | None
Completion status (“stop”, “length”, etc.)
error
str | None
Error message if processing failed

Properties

goldkey
str
Combined identifier: "{s3_path}-{pagenum}"

Static Methods

from_goldkey
Construct NormalizedEntry from a gold key string.
goldkey
str
required
Gold key in format "s3://path/file.pdf-5"
**kwargs
dict
Additional fields (text, finish_reason, error)
from olmocr.eval.runeval import NormalizedEntry

entry = NormalizedEntry.from_goldkey(
    goldkey="s3://bucket/doc.pdf-3",
    text="Extracted content...",
    finish_reason="stop"
)

print(entry.s3_path)  # "s3://bucket/doc.pdf"
print(entry.pagenum)  # 3
print(entry.goldkey)  # "s3://bucket/doc.pdf-3"

Comparison (buildelo)

Pairwise comparison between two OCR extraction methods. Source: olmocr/eval/buildelo.py:19
pdf_path
str
Path to source PDF
comparison_a_path
str
Path to first method’s output markdown
comparison_b_path
str
Path to second method’s output markdown
comparison_a_str
str
First method’s extracted text
comparison_b_str
str
Second method’s extracted text
alignment
float
Similarity score between the two methods (0-1)

Properties

comparison_a_method
str
Extracted method name from path (e.g., “pdelf”, “marker”)
comparison_b_method
str
Extracted method name from path

Evaluation Metrics

Document Edit Similarity

The evaluation uses the Hirschberg alignment algorithm with document edit similarity:
match_score
int
default:"1"
Score for matching segments
mismatch_score
int
default:"-1"
Penalty for mismatched segments
indel_score
int
default:"-1"
Penalty for insertions/deletions
from dolma_refine.evaluate.aligners import HirschbergAligner
from dolma_refine.evaluate.metrics import DocumentEditSimilarity
from dolma_refine.evaluate.segmenters import SpacySegmenter

# Create metric calculator
segmenter = SpacySegmenter("spacy")
aligner = HirschbergAligner(
    match_score=1,
    mismatch_score=-1,
    indel_score=-1
)
comparer = DocumentEditSimilarity(
    segmenter=segmenter,
    aligner=aligner
)

# Compute similarity
score = comparer.compute(gold_text, pred_text)
print(f"Alignment: {score:.3f}")

Scoring Interpretation

1.0
Perfect Match
Exact alignment, no differences
0.90-0.99
Excellent
Minor differences, high quality extraction
0.80-0.89
Good
Some differences, acceptable quality
0.70-0.79
Fair
Noticeable differences, needs improvement
< 0.70
Poor
Significant differences, major issues

Evaluation Workflow

1

Prepare Data

Organize gold standard and evaluation outputs in JSONL format
2

Load Gold Standard

Load reference annotations with multithreaded processing
3

Process Evaluation Files

Compare each evaluation file against gold data using parallel processing
4

Compute Metrics

Calculate page-weighted and character-weighted alignment scores
5

Generate Reports

Create HTML review pages for worst cases and random samples

Pairwise Comparison

process_single_pdf

Process a single PDF and generate all pairwise comparisons between methods.
pdf_path
str
required
Path to PDF file
all_mds
set[str]
required
Set of all available markdown output files
comparisons
list[str]
required
List of method names to compare (e.g., [“pdelf”, “marker”, “gotocr_format”])
segmenter_name
str
default:"spacy"
Name of text segmenter to use
result_comps
list[Comparison]
List of Comparison objects for all method pairs
Source: olmocr/eval/buildelo.py:43

build_review_page

Generate HTML review page from comparison results.
args
argparse.Namespace
required
Command-line arguments including name and configuration
comparisons
list[Comparison]
required
List of comparison objects to include
index
int
default:"0"
Report index for multi-page generation
Source: olmocr/eval/buildelo.py:83

Supported Input Formats

OpenAI Batch API Format

{
  "custom_id": "s3://bucket/file.pdf-1",
  "response": {
    "body": {
      "choices": [{
        "message": {"content": "{\"natural_text\": \"...\"}"}
        "finish_reason": "stop"
      }]
    }
  }
}

Birr Format

{
  "custom_id": "s3://bucket/file.pdf-1",
  "outputs": [{
    "text": "{\"natural_text\": \"...\"}",
    "finish_reason": "stop"
  }],
  "completion_error": null
}

SGLang Format

{
  "custom_id": "s3://bucket/file.pdf-1",
  "response": {
    "choices": [{
      "message": {"content": "..."},
      "finish_reason": "stop"
    }]
  }
}

Command-Line Usage

Running Evaluation

python -m olmocr.eval.runeval \
  --name experiment_v1 \
  --review_size 50 \
  s3://bucket/gold-data/ \
  s3://bucket/eval-data/

Building Comparison Pages

python -m olmocr.eval.buildelo \
  --name method_comparison \
  --review_size 100 \
  --comparisons pdelf marker gotocr_format mineru \
  --num_copies 3 \
  --max_workers 8 \
  s3://bucket/outputs/

Best Practices

  • Ensure gold data has finish_reason == "stop" for valid entries
  • Filter out length overruns and errors before evaluation
  • Use consistent PDF paths and page numbering (1-indexed)
  • Compress large datasets with zstandard (.zst extension)
  • Increase max_workers for faster parallel processing
  • Cache gold data locally to avoid repeated S3 downloads
  • Use ProcessPoolExecutor for CPU-bound comparison tasks
  • Filter high-similarity pairs (>0.96) from comparison reviews
  • Review both worst cases and random samples
  • Look for systematic errors in low-scoring pages
  • Compare character-weighted vs page-weighted metrics
  • Track error rates and overrun frequencies

Output Reports

Generated HTML review pages include:
  • Side-by-side comparison of gold and predicted text
  • Alignment score for each page
  • PDF rendering for visual reference
  • Method names and metadata
  • Filterable and sortable interface

See Also

Build docs developers (and LLMs) love