Anchor Text Extraction

What is Anchor Text?

Anchor text is raw text extracted directly from PDFs using traditional OCR engines. It serves as a “hint” or “anchor” to guide the vision-language model in understanding the document’s content and structure.

While anchor text may be messy and poorly formatted, it provides crucial context that helps the VLM produce cleaner, more accurate output.

Why Anchor Text Matters

Vision models alone can struggle with:

Small fonts: Text below a certain size becomes difficult to read in rendered images
Complex layouts: Multi-column documents, tables, and unusual formatting
Reading order: Determining the correct sequence of text blocks
Special characters: Math symbols, Unicode characters, and ligatures

Anchor text provides textual context that complements visual understanding:

Content Hints

Raw text gives the model clues about what words appear on the page

Structure Context

Element positions help the model understand document layout

Quality Improvement

Combining vision + text yields better results than either alone

Fallback Safety

If vision model fails, anchor text provides a basic extraction

Extraction Methods

olmOCR supports multiple anchor text extraction engines via the get_anchor_text() function:

def get_anchor_text(
    local_pdf_path: str,
    page: int,
    pdf_engine: Literal["pdftotext", "pdfium", "pypdf", "topcoherency", "pdfreport"],
    target_length: int = 4000
) -> str:

Available Engines

pdfreport (Recommended)
pdftotext
pdfium
pypdf
topcoherency

The pdfreport engine provides rich structural information about the page:

anchor_text = get_anchor_text(
    "document.pdf",
    page=1,
    pdf_engine="pdfreport",
    target_length=6000
)

Output format:

Page dimensions: 612.0x792.0
[72x720]Introduction
[72x680]This document describes the architecture of...
[Image 100x400 to 500x600]
[72x350]The system consists of three main components:
[90x320]1. Data processing pipeline
[90x300]2. Model training infrastructure

Features:

Text elements with x/y coordinates
Image bounding boxes
Page dimensions
Intelligent sampling when text exceeds target_length

Uses Poppler’s pdftotext command-line tool:

anchor_text = get_anchor_text(
    "document.pdf",
    page=1,
    pdf_engine="pdftotext"
)

Features:

Fast and reliable
Good for simple documents
No position information
Used as fallback when vision model fails

Uses pypdfium2 library:

anchor_text = get_anchor_text(
    "document.pdf",
    page=1,
    pdf_engine="pdfium"
)

Features:

Python-native extraction
No external dependencies
Moderate quality

Uses pypdf library’s built-in extraction:

anchor_text = get_anchor_text(
    "document.pdf",
    page=1,
    pdf_engine="pypdf"
)

Features:

Simple extraction
Already used for PDF metadata
May struggle with complex layouts

Tries all engines and picks the most coherent result:

anchor_text = get_anchor_text(
    "document.pdf",
    page=1,
    pdf_engine="topcoherency"
)

Process:

Extract text with pdftotext, pdfium, and pypdf
Calculate coherency score for each
Return the most coherent text

Note: Slower as it runs multiple engines per page

pdfreport Implementation

The pdfreport engine is the most sophisticated, providing detailed page analysis.

Page Report Structure

@dataclass(frozen=True)
class PageReport:
    mediabox: BoundingBox
    text_elements: List[TextElement]
    image_elements: List[ImageElement]

@dataclass(frozen=True)
class TextElement(Element):
    text: str
    x: float
    y: float

@dataclass(frozen=True)
class ImageElement(Element):
    name: str
    bbox: BoundingBox

Extraction Process

From anchor.py:128-158:

def _pdf_report(local_pdf_path: str, page_num: int) -> PageReport:
    reader = PdfReader(local_pdf_path)
    page = reader.pages[page_num - 1]
    resources = page.get("/Resources", {})
    xobjects = resources.get("/XObject", {})
    text_elements, image_elements = [], []

    def visitor_body(text, cm, tm, font_dict, font_size):
        txt2user = _mult(tm, cm)
        text_elements.append(TextElement(text, txt2user[4], txt2user[5]))

    def visitor_op(op, args, cm, tm):
        if op == b"Do":
            xobject_name = args[0]
            xobject = xobjects.get(xobject_name)
            if xobject and xobject["/Subtype"] == "/Image":
                x0, y0 = _transform_point(0, 0, cm)
                x1, y1 = _transform_point(1, 1, cm)
                image_elements.append(
                    ImageElement(xobject_name, BoundingBox(min(x0, x1), min(y0, y1), max(x0, x1), max(y0, y1)))
                )

    page.extract_text(visitor_text=visitor_body, visitor_operand_before=visitor_op)

    return PageReport(
        mediabox=BoundingBox.from_rectangle(page.mediabox),
        text_elements=text_elements,
        image_elements=image_elements
    )

Linearization Algorithm

The _linearize_pdf_report() function converts the page report to a text string with intelligent truncation:

Add Page Dimensions

result = f"Page dimensions: {report.mediabox.x1:.1f}x{report.mediabox.y1:.1f}\n"

Merge Overlapping Images

Images that overlap or are close together are merged into single elements:

images = _merge_image_elements(report.image_elements, tolerance=0.5)

This prevents duplicate [Image ...] entries for composite images.

Format Elements

Text elements:

text_str = f"[{element.x:.0f}x{element.y:.0f}]{element_text}\n"

Image elements:

image_str = f"[Image {element.bbox.x0:.0f}x{element.bbox.y0:.0f} to {element.bbox.x1:.0f}x{element.bbox.y1:.0f}]\n"

Handle Length Limits

If total content exceeds max_length (default 4000 chars):

Identify edge elements: Elements with min/max x/y coordinates (usually headers, footers, margins)
Include edges first: These provide document structure context
Randomly sample remaining: Fill remaining space with random elements
Sort by position: Maintain logical reading order

# Find edge elements
edge_elements = set()
if images:
    edge_elements.update([
        min(images, key=lambda e: e.bbox.x0),
        max(images, key=lambda e: e.bbox.x1),
        min(images, key=lambda e: e.bbox.y0),
        max(images, key=lambda e: e.bbox.y1)
    ])

if report.text_elements:
    text_elements = [e for e in report.text_elements if len(e.text.strip()) > 0]
    edge_elements.update([
        min(text_elements, key=lambda e: e.x),
        max(text_elements, key=lambda e: e.x),
        min(text_elements, key=lambda e: e.y),
        max(text_elements, key=lambda e: e.y)
    ])

# Add edges first, then randomly sample remaining
random.shuffle(remaining_elements)
for elem in remaining_elements:
    if current_length + len(elem_str) > max_length:
        break
    selected_elements.append(elem)

# Sort by position for logical order
selected_elements.sort(key=lambda x: (x.position[0], x.position[1]))

Text Cleanup

Before formatting, text elements are cleaned:

def _cleanup_element_text(element_text: str) -> str:
    MAX_TEXT_ELEMENT_LENGTH = 250
    TEXT_REPLACEMENTS = {
        "[": "\\[",
        "]": "\\]",
        "\n": "\\n",
        "\r": "\\r",
        "\t": "\\t"
    }

    # Fix text encoding issues
    element_text = ftfy.fix_text(element_text).strip()

    # Escape special characters
    element_text = text_replacement_pattern.sub(
        lambda match: TEXT_REPLACEMENTS[match.group(0)],
        element_text
    )

    # Truncate long elements
    return _cap_split_string(element_text, MAX_TEXT_ELEMENT_LENGTH)

Square brackets are escaped because they’re used to denote coordinates. Newlines are escaped to maintain single-line format.

Usage in Pipeline

Anchor text is extracted during the build_page_query() function:

async def build_page_query(local_pdf_path, page, target_longest_image_dim, target_anchor_text_len, image_rotation=0):
    # Render image in background thread
    image_base64 = asyncio.to_thread(
        render_pdf_to_base64png,
        local_pdf_path,
        page,
        target_longest_image_dim=target_longest_image_dim
    )

    # Extract anchor text in process pool (CPU-bound, not thread-safe)
    loop = asyncio.get_running_loop()
    anchor_text = loop.run_in_executor(
        process_pool,
        partial(get_anchor_text, pdf_engine="pdfreport", target_length=target_anchor_text_len),
        local_pdf_path,
        page
    )

    # Wait for both operations
    image_base64, anchor_text = await asyncio.gather(image_base64, anchor_text)

    # Build prompt with anchor text
    return {
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": build_finetuning_prompt(anchor_text)},
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}}
                ]
            }
        ],
        "max_tokens": 3000,
        "temperature": 0.8
    }

Process Pool Requirement

get_anchor_text() must run in a process pool, not a thread pool:

The underlying PDF libraries are not thread-safe
Multiple concurrent calls in threads can cause crashes
Process pools provide isolated memory space

From pipeline.py:112-116:

# GET ANCHOR TEXT IS NOT THREAD SAFE!! Ahhhh..... don't try to do it
# and it's also CPU bound, so it needs to run in a process pool
loop = asyncio.get_running_loop()
anchor_text = loop.run_in_executor(
    process_pool,
    partial(get_anchor_text, pdf_engine="pdfreport", target_length=target_anchor_text_len),
    local_pdf_path,
    page
)

Dynamic Length Adjustment

If the vision model input exceeds the context window, anchor text length is automatically reduced:

if base_response_data["usage"]["total_tokens"] > args.model_max_context:
    local_anchor_text_len = max(1, local_anchor_text_len // 2)
    logger.info(f"Reducing anchor text len to {local_anchor_text_len} for {pdf_orig_path}-{page_num}")
    raise ValueError("Response exceeded model_max_context, cannot use this response")

The pipeline retries with half the anchor text length, ensuring the request fits within the model’s context limit.

Fallback Behavior

If all vision model retries fail, the pipeline falls back to pure anchor text:

return PageResult(
    pdf_orig_path,
    page_num,
    PageResponse(
        natural_text=get_anchor_text(pdf_local_path, page_num, pdf_engine="pdftotext"),
        primary_language=None,
        is_rotation_valid=True,
        rotation_correction=0,
        is_table=False,
        is_diagram=False
    ),
    input_tokens=0,
    output_tokens=0,
    is_fallback=True
)

Note: Uses pdftotext for fallback (fastest and most reliable).

Example Output Comparison

pdftotext Output
pdfreport Output

Introduction

This document describes the architecture of our system. The system
consists of three main components:

1. Data processing pipeline
2. Model training infrastructure
3. Deployment system

Each component is described in detail below.

Page dimensions: 612.0x792.0
[72x720]Introduction
[72x680]This document describes the architecture of our system. The system
[72x666]consists of three main components:
[90x640]1. Data processing pipeline
[90x620]2. Model training infrastructure
[90x600]3. Deployment system
[Image 400x500 to 550x650]
[72x450]Each component is described in detail below.

The pdfreport format provides:

Spatial context: x/y coordinates show text positioning
Image awareness: Bounding boxes indicate where images appear
Page structure: Dimensions help understand document layout

Best Practices

Choosing the Right Engine

Production: Use pdfreport for best quality
Simple docs: Use pdftotext for speed
Fallback: Use pdftotext when vision model fails
Experimentation: Use topcoherency to compare engines

Setting Target Length

Default: 6000 characters works well for most documents
Dense pages: Increase to 8000-10000 for pages with lots of text
Simple pages: Reduce to 4000 for faster processing
Context limits: Will auto-reduce if exceeding model context

Performance Optimization

Process pools: Always use process pools, never thread pools
Async operations: Pair with async image rendering for parallelism
Caching: Consider caching anchor text for repeated processing

Next Steps

Pipeline Architecture

See how anchor text fits into the full pipeline

Prompting Strategy

Learn how anchor text is used in prompts

API Reference

Full API documentation for anchor text functions

Training

Train models to use anchor text effectively

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

Anchor Text Extraction

What is Anchor Text?

Why Anchor Text Matters

Content Hints

Structure Context

Quality Improvement

Fallback Safety

Extraction Methods

Available Engines

pdfreport Implementation

Page Report Structure

Extraction Process

Linearization Algorithm

Text Cleanup

Usage in Pipeline

Process Pool Requirement

Dynamic Length Adjustment

Fallback Behavior

Example Output Comparison

Best Practices

Next Steps

Pipeline Architecture

Prompting Strategy

API Reference

Training

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

​What is Anchor Text?

​Why Anchor Text Matters

Content Hints

Structure Context

Quality Improvement

Fallback Safety

​Extraction Methods

​Available Engines

​pdfreport Implementation

​Page Report Structure

​Extraction Process

​Linearization Algorithm

​Text Cleanup

​Usage in Pipeline

​Process Pool Requirement

​Dynamic Length Adjustment

​Fallback Behavior

​Example Output Comparison

​Best Practices

​Next Steps

Pipeline Architecture

Prompting Strategy

API Reference

Training

Build docs developers (and LLMs) love

What is Anchor Text?

Why Anchor Text Matters

Extraction Methods

Available Engines

pdfreport Implementation

Page Report Structure

Extraction Process

Linearization Algorithm

Text Cleanup

Usage in Pipeline

Process Pool Requirement

Dynamic Length Adjustment

Fallback Behavior

Example Output Comparison

Best Practices

Next Steps