Comparing Results

Overview

olmOCR provides powerful tools for comparing and analyzing evaluation results. The system generates interactive HTML review pages for side-by-side comparisons and supports ELO scoring for ranking multiple OCR tools.

HTML Evaluation Viewer

The HTML evaluation viewer provides an interactive interface for reviewing and comparing OCR results against gold standard data.

Review Page Types

Evaluations generate two types of review pages:

Worst Results

{name}_worst.html - Shows the lowest scoring pages for debugging and identifying systematic issues

Random Sample

{name}_sample.html - Random selection for overall quality assessment

Features

The HTML viewer includes:

PDF Preview: Rendered page image for visual reference
Side-by-Side Comparison: Gold standard vs. evaluation output
Diff Highlighting: Visual differences with color coding
Alignment Scores: Quantitative similarity metrics
Randomized Display: Prevents bias by randomly positioning gold/eval text
Metadata Labels: Shows which method produced each output

HTML Structure

Each entry in the review page displays:

<div class="entry">
  <!-- Page Image -->
  <img src="data:image/png;base64,..." />
  
  <!-- Metadata -->
  <div class="info">
    <a href="{signed_pdf_url}">Download PDF</a>
    Page: 1 | Alignment: 0.924
  </div>
  
  <!-- Side by Side Text -->
  <div class="comparison">
    <div class="left">
      <h3>Version A (marker)</h3>
      <p>Text content...</p>
    </div>
    <div class="right">
      <h3>Version B (olmocr)</h3>
      <p>Text content...</p>
    </div>
  </div>
  
  <!-- Diff View -->
  <div class="diff">
    <span class="removed">Deleted text</span>
    <span class="added">Added text</span>
  </div>
</div>

Diff Highlighting

The diff view uses color coding to show differences:

Added Content

Green highlighting indicates text present in the evaluation output but not in the reference

Removed Content

Red highlighting indicates text missing from the evaluation output

Replaced Content

Red followed by green indicates text that was changed between versions

Diff Generation

The diff is generated using Python’s SequenceMatcher:

from difflib import SequenceMatcher

def generate_diff_html(a, b):
    seq_matcher = SequenceMatcher(None, a, b)
    output_html = ""
    for opcode, a0, a1, b0, b1 in seq_matcher.get_opcodes():
        if opcode == "equal":
            output_html += a[a0:a1]
        elif opcode == "insert":
            output_html += f"<span class='added'>{b[b0:b1]}</span>"
        elif opcode == "delete":
            output_html += f"<span class='removed'>{a[a0:a1]}</span>"
        elif opcode == "replace":
            output_html += f"<span class='removed'>{a[a0:a1]}</span>"
            output_html += f"<span class='added'>{b[b0:b1]}</span>"
    return output_html

ELO Scoring System

The ELO scoring system (buildelo.py) generates pairwise comparisons between different OCR methods to create relative quality rankings.

How It Works

Generate Comparisons

For each PDF, create all possible pairwise comparisons between different parsing methods:

from itertools import combinations

# If we have outputs from: pdelf, marker, gotocr, mineru
# Generate: (pdelf, marker), (pdelf, gotocr), (pdelf, mineru), 
#           (marker, gotocr), (marker, mineru), (gotocr, mineru)
for compa, compb in combinations(pdf_outputs, 2):
    # Randomly swap order to prevent bias
    if random.choice([True, False]):
        compa, compb = compb, compa

Calculate Similarity

Compute DocumentEditSimilarity for each pair:

alignment = comparer.compute(text_a, text_b)

Filter Similar Results

Remove comparisons that are too similar (alignment > 0.96) as they provide little information

Generate Review Pages

Create HTML pages for human review and voting

Running ELO Evaluation

python olmocr/eval/buildelo.py \
  --name comparison_name \
  --review_size 50 \
  --comparisons pdelf marker gotocr_format mineru \
  --num_copies 3 \
  --max_workers 8 \
  s3://bucket/path/to/outputs/

Command-Line Arguments

--name

string

default:"review_page"

Base name for generated HTML files

--review_size

integer

default:"50"

Number of comparisons per review page

--comparisons

list

List of method names to compare (must match filename suffixes)

--num_copies

integer

default:"1"

Number of review pages to generate. If > 1, files are named {name}_0.html, {name}_1.html, etc.

--max_workers

integer

default:"None"

Number of parallel worker processes (defaults to CPU count)

s3_path

string

required

Path to folder containing comparison outputs (expects *.md, *.pdf, *.png files)

File Naming Convention

The ELO system expects files to follow this pattern:

outputs/
├── document.pdf
├── document_page1_pdelf.md
├── document_page1_marker.md
├── document_page1_gotocr_format.md
├── document_page1_mineru.md
└── document_page1.png

The method name is extracted from the filename suffix (e.g., _marker.md → marker).

Example: Comparing Four Tools

python olmocr/eval/buildelo.py \
  --name ocr_comparison \
  --review_size 100 \
  --comparisons olmocr marker gotocr mineru \
  --num_copies 5 \
  --max_workers 16 \
  s3://my-bucket/ocr-outputs/

This generates:

ocr_comparison_0.html - 100 pairwise comparisons
ocr_comparison_1.html - Next 100 comparisons
ocr_comparison_2.html - Next 100 comparisons
ocr_comparison_3.html - Next 100 comparisons
ocr_comparison_4.html - Next 100 comparisons

Generating multiple copies allows for distributed human evaluation across team members.

Comparison Data Structure

Each comparison contains:

@dataclass
class Comparison:
    pdf_path: str                  # Original PDF
    comparison_a_path: str         # Path to method A output
    comparison_b_path: str         # Path to method B output
    comparison_a_str: str          # Method A text
    comparison_b_str: str          # Method B text
    alignment: float               # Similarity score
    
    @property
    def comparison_a_method(self):
        # Extracts method name from path (e.g., "marker")
        match = re.search(r"page[0-9]+_(\w+)\.md$", self.comparison_a_path)
        return match.group(1)

Analyzing Results

Identifying Systematic Issues

Use the worst results page to identify patterns:

Review Low Scores

Examine pages with alignment < 0.7 for systematic failures

Categorize Errors

Group errors by type:

Missing content (red highlighting)
Hallucinated content (green highlighting)
Structural issues (incorrect ordering)
Formatting problems (table/formula extraction)

Identify Patterns

Look for document characteristics that correlate with failures:

Multi-column layouts
Complex tables
Mathematical formulas
Headers/footers
Image captions

Quality Assessment Guidelines

Score interpretation:

Excellent

0.95 - 1.00Near-perfect extraction with minor differences in whitespace or formatting

Good

0.85 - 0.94High-quality extraction with some missing or extra content

Fair

0.70 - 0.84Acceptable quality but notable content issues

Poor

0.50 - 0.69Significant content missing or incorrect

Failed

0.00 - 0.49Severe extraction failures or wrong content

Error

N/AProcessing error or timeout

Performance Optimization

The comparison system uses parallel processing for efficiency:

Parallel PDF Processing

from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor(max_workers=16) as executor:
    future_to_pdf = {
        executor.submit(process_pdf, pdf_path): pdf_path 
        for pdf_path in all_pdfs
    }
    
    for future in tqdm(as_completed(future_to_pdf), total=len(all_pdfs)):
        pdf_results = future.result()
        all_comparisons.extend(pdf_results)

Resource Requirements

For large evaluations:

Memory: ~2GB per worker process
CPU: Scales linearly with --max_workers
Disk: Minimal (HTML outputs only)
Network: Required for S3 access

Set --max_workers based on available CPU cores. For S3 data, network bandwidth may be the bottleneck rather than CPU.

Viewing HTML Pages

Local Viewing

Simply open the HTML files in a web browser:

# macOS
open review_page.html

# Linux
xdg-open review_page.html

# Windows
start review_page.html

Serving Over HTTP

For remote access or team collaboration:

# Python 3
python -m http.server 8000

# Then visit: http://localhost:8000/review_page.html

The HTML files are self-contained and can be:

Uploaded to internal web servers
Shared via cloud storage (Dropbox, Google Drive)
Emailed directly (if file size permits)
Hosted on GitHub Pages for public benchmarks

Be mindful of sensitive data in evaluation results. HTML files contain embedded images and text from processed documents.

Best Practices

Sample Size

Use --review_size 50-100 for thorough manual review
Generate multiple copies (--num_copies 3-5) for inter-rater reliability
Always review both worst and random samples

Comparison Selection

Include at least 3-4 methods for meaningful ELO rankings
Filter out very similar results (alignment > 0.96) to focus on differences
Randomize comparison order to prevent bias

Error Analysis

Start with worst results to identify failure modes
Cross-reference with benchmark property tests
Document systematic issues for targeted improvements

Result Validation

Compare page-weighted vs char-weighted metrics
Verify evaluation data matches expected format
Check for processing errors and overruns

Troubleshooting

Low Alignment Scores

If seeing unexpectedly low scores:

Check text format: Ensure evaluation output is plain text, not JSON/XML wrapped
Verify goldkey matching: Confirm {s3_path}-{pagenum} format is consistent
Inspect HTML diffs: Look for systematic formatting differences
Review empty documents: Empty gold + empty eval = 1.0, but empty gold + content eval = 0.0

Missing Comparisons

If expected comparisons don’t appear:

Verify file naming: Check that suffixes match --comparisons argument
Check S3 paths: Ensure glob patterns match file locations
Review filter threshold: Lower from 0.96 if too many comparisons filtered out

Performance Issues

For slow evaluation:

Reduce review size: Use smaller --review_size for testing
Adjust workers: Try different --max_workers values
Use local data: Download S3 data locally for faster access
Check S3 bandwidth: Network may be bottleneck for S3 operations

olmOCR-Bench

Return to benchmark suite documentation

Running Evals

Learn about evaluation execution

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

​Overview

​HTML Evaluation Viewer

​Review Page Types

Worst Results

Random Sample

​Features

​HTML Structure

​Diff Highlighting

​Diff Generation

​ELO Scoring System

​How It Works

​Running ELO Evaluation

​Command-Line Arguments

​File Naming Convention

​Example: Comparing Four Tools

​Comparison Data Structure

​Analyzing Results

​Identifying Systematic Issues

​Quality Assessment Guidelines

Excellent

Good

Fair

Poor

Failed

Error

​Performance Optimization

​Parallel PDF Processing

​Resource Requirements

​Viewing HTML Pages

​Local Viewing

​Serving Over HTTP

​Sharing Results

​Best Practices

​Troubleshooting

​Low Alignment Scores

​Missing Comparisons

​Performance Issues

​Related

olmOCR-Bench

Running Evals

Build docs developers (and LLMs) love

Overview

HTML Evaluation Viewer

Review Page Types

Features

HTML Structure

Diff Highlighting

Diff Generation

ELO Scoring System

How It Works

Running ELO Evaluation

Command-Line Arguments

File Naming Convention

Example: Comparing Four Tools

Comparison Data Structure

Analyzing Results

Identifying Systematic Issues

Quality Assessment Guidelines

Performance Optimization

Parallel PDF Processing

Resource Requirements

Viewing HTML Pages

Local Viewing

Serving Over HTTP

Sharing Results

Best Practices

Troubleshooting

Low Alignment Scores

Missing Comparisons

Performance Issues

Related