Evaluation API

Overview

The Evaluation API provides comprehensive tools for assessing OCR model quality through alignment metrics, human review interfaces, and pairwise comparisons. It supports evaluation against gold standard datasets and side-by-side method comparisons.

Core Modules

runeval - Gold Standard Evaluation

Evaluate OCR outputs against gold standard annotations using document edit similarity metrics. Source: olmocr/eval/runeval.py

buildelo - Pairwise Method Comparison

Generate comparison pages for ranking different OCR extraction methods. Source: olmocr/eval/buildelo.py

Main Functions

do_eval

Compare OCR outputs against gold standard data and generate review HTML pages.

gold_data_path

str

required

Path to gold standard JSONL files (local directory or S3 path)

eval_data_path

str

required

Path to evaluation JSONL files to compare

review_page_name

str

required

Base name for generated HTML review pages

review_page_size

int

required

Number of samples to include in review pages

alignment_score

float

Character-weighted mean alignment score (0-1 range)

page_eval_data

list[dict]

List of page comparison data with alignment scores

Source: olmocr/eval/runeval.py:278

from olmocr.eval.runeval import do_eval

# Evaluate against gold standard
score, eval_data = do_eval(
    gold_data_path="s3://bucket/gold-data/",
    eval_data_path="s3://bucket/model-outputs/",
    review_page_name="model_v1_eval",
    review_page_size=50
)

print(f"Mean alignment: {score:.3f}")

Generates two HTML files:

{review_page_name}_worst.html - Lowest alignment samples
{review_page_name}_sample.html - Random sample of results

load_gold_data

Load gold standard annotations from JSONL files with multithreaded processing.

gold_data_path

str

required

Path to directory containing gold JSONL files

max_workers

int

default:"8"

Maximum number of threads for parallel loading

gold_data

dict[str, str]

Dictionary mapping gold keys (e.g., "s3://path/file.pdf-4") to text content

Source: olmocr/eval/runeval.py:140

from olmocr.eval.runeval import load_gold_data

gold_data = load_gold_data(
    gold_data_path="s3://bucket/gold/",
    max_workers=16
)

print(f"Loaded {len(gold_data)} gold entries")

normalize_json_entry

Normalize different JSONL formats (OpenAI, Birr, SGLang) into a common structure.

data

dict

required

Raw JSONL entry from any supported format

entry

NormalizedEntry

Normalized entry with consistent fields

Source: olmocr/eval/runeval.py:84

import json
from olmocr.eval.runeval import normalize_json_entry

with open("openai_response.jsonl") as f:
    for line in f:
        data = json.loads(line)
        entry = normalize_json_entry(data)
        print(f"{entry.s3_path} page {entry.pagenum}: {entry.text[:100]}")

process_jsonl_file

Process a single JSONL file and compute alignment scores against gold data.

jsonl_file

str

required

Path to JSONL file to process

gold_data

dict

required

Gold standard data dictionary

comparer

DocumentEditSimilarity

required

Metric calculator instance

results

tuple

Returns (total_score, char_weighted_score, total_chars, total_pages, errors, overruns, page_data)

Source: olmocr/eval/runeval.py:231

Data Classes

NormalizedEntry

Standardized representation of OCR output for any input format. Source: olmocr/eval/runeval.py:66

s3_path

str

S3 path to source PDF document

pagenum

int

Page number (1-indexed)

text

str | None

Extracted OCR text content

finish_reason

str | None

Completion status (“stop”, “length”, etc.)

error

str | None

Error message if processing failed

Properties

goldkey

str

Combined identifier: "{s3_path}-{pagenum}"

Static Methods

from_goldkey

Construct NormalizedEntry from a gold key string.

goldkey

str

required

Gold key in format "s3://path/file.pdf-5"

**kwargs

dict

Additional fields (text, finish_reason, error)

from olmocr.eval.runeval import NormalizedEntry

entry = NormalizedEntry.from_goldkey(
    goldkey="s3://bucket/doc.pdf-3",
    text="Extracted content...",
    finish_reason="stop"
)

print(entry.s3_path)  # "s3://bucket/doc.pdf"
print(entry.pagenum)  # 3
print(entry.goldkey)  # "s3://bucket/doc.pdf-3"

Comparison (buildelo)

Pairwise comparison between two OCR extraction methods. Source: olmocr/eval/buildelo.py:19

pdf_path

str

Path to source PDF

comparison_a_path

str

Path to first method’s output markdown

comparison_b_path

str

Path to second method’s output markdown

comparison_a_str

str

First method’s extracted text

comparison_b_str

str

Second method’s extracted text

alignment

float

Similarity score between the two methods (0-1)

Properties

comparison_a_method

str

Extracted method name from path (e.g., “pdelf”, “marker”)

comparison_b_method

str

Extracted method name from path

Evaluation Metrics

Document Edit Similarity

The evaluation uses the Hirschberg alignment algorithm with document edit similarity:

match_score

int

default:"1"

Score for matching segments

mismatch_score

int

default:"-1"

Penalty for mismatched segments

indel_score

int

default:"-1"

Penalty for insertions/deletions

from dolma_refine.evaluate.aligners import HirschbergAligner
from dolma_refine.evaluate.metrics import DocumentEditSimilarity
from dolma_refine.evaluate.segmenters import SpacySegmenter

# Create metric calculator
segmenter = SpacySegmenter("spacy")
aligner = HirschbergAligner(
    match_score=1,
    mismatch_score=-1,
    indel_score=-1
)
comparer = DocumentEditSimilarity(
    segmenter=segmenter,
    aligner=aligner
)

# Compute similarity
score = comparer.compute(gold_text, pred_text)
print(f"Alignment: {score:.3f}")

Scoring Interpretation

1.0

Perfect Match

Exact alignment, no differences

0.90-0.99

Excellent

Minor differences, high quality extraction

0.80-0.89

Good

Some differences, acceptable quality

0.70-0.79

Fair

Noticeable differences, needs improvement

< 0.70

Poor

Significant differences, major issues

Evaluation Workflow

Prepare Data

Organize gold standard and evaluation outputs in JSONL format

Load Gold Standard

Load reference annotations with multithreaded processing

Process Evaluation Files

Compare each evaluation file against gold data using parallel processing

Compute Metrics

Calculate page-weighted and character-weighted alignment scores

Generate Reports

Create HTML review pages for worst cases and random samples

Pairwise Comparison

process_single_pdf

Process a single PDF and generate all pairwise comparisons between methods.

pdf_path

str

required

Path to PDF file

all_mds

set[str]

required

Set of all available markdown output files

comparisons

list[str]

required

List of method names to compare (e.g., [“pdelf”, “marker”, “gotocr_format”])

segmenter_name

str

default:"spacy"

Name of text segmenter to use

result_comps

list[Comparison]

List of Comparison objects for all method pairs

Source: olmocr/eval/buildelo.py:43

build_review_page

Generate HTML review page from comparison results.

args

argparse.Namespace

required

Command-line arguments including name and configuration

comparisons

list[Comparison]

required

List of comparison objects to include

index

int

default:"0"

Report index for multi-page generation

Source: olmocr/eval/buildelo.py:83

Supported Input Formats

OpenAI Batch API Format

{
  "custom_id": "s3://bucket/file.pdf-1",
  "response": {
    "body": {
      "choices": [{
        "message": {"content": "{\"natural_text\": \"...\"}"}
        "finish_reason": "stop"
      }]
    }
  }
}

Birr Format

{
  "custom_id": "s3://bucket/file.pdf-1",
  "outputs": [{
    "text": "{\"natural_text\": \"...\"}",
    "finish_reason": "stop"
  }],
  "completion_error": null
}

SGLang Format

{
  "custom_id": "s3://bucket/file.pdf-1",
  "response": {
    "choices": [{
      "message": {"content": "..."},
      "finish_reason": "stop"
    }]
  }
}

Command-Line Usage

Running Evaluation

python -m olmocr.eval.runeval \
  --name experiment_v1 \
  --review_size 50 \
  s3://bucket/gold-data/ \
  s3://bucket/eval-data/

Building Comparison Pages

python -m olmocr.eval.buildelo \
  --name method_comparison \
  --review_size 100 \
  --comparisons pdelf marker gotocr_format mineru \
  --num_copies 3 \
  --max_workers 8 \
  s3://bucket/outputs/

Best Practices

Data Preparation

Ensure gold data has finish_reason == "stop" for valid entries
Filter out length overruns and errors before evaluation
Use consistent PDF paths and page numbering (1-indexed)
Compress large datasets with zstandard (.zst extension)

Performance Optimization

Increase max_workers for faster parallel processing
Cache gold data locally to avoid repeated S3 downloads
Use ProcessPoolExecutor for CPU-bound comparison tasks
Filter high-similarity pairs (>0.96) from comparison reviews

Result Analysis

Review both worst cases and random samples
Look for systematic errors in low-scoring pages
Compare character-weighted vs page-weighted metrics
Track error rates and overrun frequencies

Output Reports

Generated HTML review pages include:

Side-by-side comparison of gold and predicted text
Alignment score for each page
PDF rendering for visual reference
Method names and metadata
Filterable and sortable interface

Pipeline

Data Processing

Training & Evaluation

Utilities

Evaluation API

Overview

Core Modules

runeval - Gold Standard Evaluation

buildelo - Pairwise Method Comparison

Main Functions

do_eval

load_gold_data

normalize_json_entry

process_jsonl_file

Data Classes

NormalizedEntry

Properties

Static Methods

from_goldkey

Comparison (buildelo)

Properties

Evaluation Metrics

Document Edit Similarity

Scoring Interpretation

Evaluation Workflow

Pairwise Comparison

process_single_pdf

build_review_page

Supported Input Formats

OpenAI Batch API Format

Birr Format

SGLang Format

Command-Line Usage

Running Evaluation

Building Comparison Pages

Best Practices

Output Reports

See Also

Build docs developers (and LLMs) love

Pipeline

Data Processing

Training & Evaluation

Utilities

​Overview

​Core Modules

​runeval - Gold Standard Evaluation

​buildelo - Pairwise Method Comparison

​Main Functions

​do_eval

​load_gold_data

​normalize_json_entry

​process_jsonl_file

​Data Classes

​NormalizedEntry

​Properties

​Static Methods

from_goldkey

​Comparison (buildelo)

​Properties

​Evaluation Metrics

​Document Edit Similarity

​Scoring Interpretation

​Evaluation Workflow

​Pairwise Comparison

​process_single_pdf

​build_review_page

​Supported Input Formats

​OpenAI Batch API Format

​Birr Format

​SGLang Format

​Command-Line Usage

​Running Evaluation

​Building Comparison Pages

​Best Practices

​Output Reports

​See Also

Build docs developers (and LLMs) love

Overview

Core Modules

runeval - Gold Standard Evaluation

buildelo - Pairwise Method Comparison

Main Functions

do_eval

load_gold_data

normalize_json_entry

process_jsonl_file

Data Classes

NormalizedEntry

Properties

Static Methods

Comparison (buildelo)

Properties

Evaluation Metrics

Document Edit Similarity

Scoring Interpretation

Evaluation Workflow

Pairwise Comparison

process_single_pdf

build_review_page

Supported Input Formats

OpenAI Batch API Format

Birr Format

SGLang Format

Command-Line Usage

Running Evaluation

Building Comparison Pages

Best Practices

Output Reports

See Also