Prompt Templates

Overview

The prompts module provides functions for building prompts for vision-language models, response format schemas for structured outputs, and dataclasses for handling OCR responses.

Prompt Construction

build_openai_silver_data_prompt

Builds the comprehensive prompt used for generating silver training data with GPT-4o.

def build_openai_silver_data_prompt(base_text: str) -> str

base_text

str

required

Raw anchor text extracted from the PDF page, including position information

Returns: Formatted prompt string instructing the model on OCR extraction

Example Usage

from olmocr.prompts import build_openai_silver_data_prompt
from olmocr.prompts.anchor import get_anchor_text

# Extract anchor text from page
anchor_text = get_anchor_text("/path/to/doc.pdf", page=1, pdf_engine="pdfreport")

# Build the prompt
prompt = build_openai_silver_data_prompt(anchor_text)

Prompt Content

The generated prompt instructs the model to:

Read naturally - Convert the document page to plain text as if reading naturally
Format appropriately:
- Equations → LaTeX representation
- Tables → Markdown format
Remove headers/footers - But keep references and footnotes
Read handwriting - Process any natural handwriting present
Preserve continuity - Keep partial sentences from previous/next pages intact
Allow null - Return null if no readable text exists
No hallucination - Only extract text that’s actually present

The prompt includes coordinate system information: origin [0x0] is in the lower left corner of the image.

build_finetuning_prompt

Builds a simplified prompt for training and running fine-tuned models.

def build_finetuning_prompt(base_text: str) -> str

base_text

str

required

Raw anchor text extracted from the PDF page

Returns: Simplified prompt string for fine-tuned model inference

Example Usage

from olmocr.prompts import build_finetuning_prompt

# For fine-tuned model inference
prompt = build_finetuning_prompt(anchor_text)

This prompt is simplified compared to the silver data prompt, focusing on core OCR extraction without detailed formatting instructions.

extract_raw_text

Extracts the anchor text component from an existing prompt string.

def extract_raw_text(prompt: str) -> str

prompt

str

required

Complete prompt string containing RAW_TEXT_START/END markers

Returns: Extracted anchor text without prompt wrapper Raises: ValueError if prompt doesn’t contain raw text markers

Example Usage

from olmocr.prompts import extract_raw_text

prompt = "Below is the image...RAW_TEXT_START\nPage content\nRAW_TEXT_END"
anchor_text = extract_raw_text(prompt)
# Returns: "Page content"

Response Schema

openai_response_format_schema

Returns the structured output schema for OpenAI’s JSON mode, ensuring consistent response format.

def openai_response_format_schema() -> dict

Returns: OpenAI-compatible JSON schema dictionary

Schema Structure

{
  "type": "json_schema",
  "json_schema": {
    "name": "page_response",
    "strict": true,
    "schema": {
      "type": "object",
      "properties": { /* See below */ },
      "required": [
        "primary_language",
        "is_rotation_valid",
        "rotation_correction",
        "is_table",
        "is_diagram",
        "natural_text"
      ],
      "additionalProperties": false
    }
  }
}

Response Fields

primary_language

string | null

Two-letter language code (e.g., “en”, “es”, “fr”) or null if no readable text exists

is_rotation_valid

boolean

Whether the page is oriented correctly for reading. Only considers textual content, not charts/figures.

rotation_correction

integer

Degrees of clockwise rotation needed if page is incorrectly oriented.Allowed values: 0, 90, 180, 270Default: 0

is_table

boolean

Whether the majority of page content is in tabular format

is_diagram

boolean

Whether the majority of page content is a visual diagram

natural_text

string | null

The natural text content extracted from the page, or null if no text should be read

Example Usage

from olmocr.prompts import openai_response_format_schema
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-2024-08-06",
    messages=[...],
    response_format=openai_response_format_schema()
)

# Parse the structured response
import json
result = json.loads(response.choices[0].message.content)
print(result["natural_text"])

PageResponse Dataclass

Typed dataclass for handling OCR responses with validation.

@dataclass(frozen=True)
class PageResponse:
    primary_language: Optional[str]
    is_rotation_valid: bool
    rotation_correction: int
    is_table: bool
    is_diagram: bool
    natural_text: Optional[str]

Validation

The dataclass performs automatic validation in __post_init__:

rotation_correction must be one of: 0, 90, 180, 270
Type checking for all fields
Raises ValueError or TypeError for invalid data

Example Usage

from olmocr.prompts import PageResponse
import json

# Parse API response
response_data = json.loads(api_response)

# Create validated PageResponse
page = PageResponse(
    primary_language=response_data["primary_language"],
    is_rotation_valid=response_data["is_rotation_valid"],
    rotation_correction=response_data["rotation_correction"],
    is_table=response_data["is_table"],
    is_diagram=response_data["is_diagram"],
    natural_text=response_data["natural_text"]
)

# Access fields
if page.natural_text:
    print(f"Extracted text in {page.primary_language}: {page.natural_text}")

if not page.is_rotation_valid:
    print(f"Rotate {page.rotation_correction} degrees clockwise")

Handling Validation Errors

from olmocr.prompts import PageResponse

try:
    page = PageResponse(
        primary_language="en",
        is_rotation_valid=True,
        rotation_correction=45,  # Invalid! Must be 0, 90, 180, or 270
        is_table=False,
        is_diagram=False,
        natural_text="Sample text"
    )
except ValueError as e:
    print(f"Invalid rotation: {e}")

Anchor Text Generation

The anchor.py module provides get_anchor_text() for extracting positional text from PDFs:

from olmocr.prompts.anchor import get_anchor_text

anchor_text = get_anchor_text(
    local_pdf_path="/path/to/doc.pdf",
    page=1,
    pdf_engine="pdfreport",
    target_length=4000
)

local_pdf_path

str

required

Path to the local PDF file

page

int

required

Page number (1-indexed)

pdf_engine

Literal

required

Engine to use for text extraction:

"pdftotext" - Uses poppler’s pdftotext
"pdfium" - Uses pdfium library
"pypdf" - Uses PyPDF
"topcoherency" - Tests multiple engines and picks the most coherent
"pdfreport" - Full positional report with images and text coordinates

target_length

int

default:"4000"

Target length for anchor text output (used with pdfreport engine)

PDF Report Format

When using pdf_engine="pdfreport", the output includes:

Page dimensions: 612.0x792.0
[Image 50x100 to 200x300]
[72x720]Document Title Here
[72x680]First paragraph of text...
[150x400]Table cell content

Page dimensions - Width x Height in points
Images - [Image x0xy0 to x1xy1] format with bounding boxes
Text - [xxy]text content with position coordinates
Origin - [0x0] is in the lower left corner

Complete Example

from olmocr.prompts import (
    build_openai_silver_data_prompt,
    openai_response_format_schema,
    PageResponse
)
from olmocr.prompts.anchor import get_anchor_text
from openai import OpenAI
import json

# 1. Extract anchor text
anchor_text = get_anchor_text(
    "/path/to/document.pdf",
    page=1,
    pdf_engine="pdfreport",
    target_length=4000
)

# 2. Build prompt
prompt = build_openai_silver_data_prompt(anchor_text)

# 3. Call OpenAI API
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o-2024-08-06",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}}
            ]
        }
    ],
    temperature=0.1,
    max_tokens=6000,
    response_format=openai_response_format_schema()
)

# 4. Parse and validate response
result_data = json.loads(response.choices[0].message.content)
page_result = PageResponse(**result_data)

# 5. Use the extracted data
if page_result.natural_text:
    print(f"Language: {page_result.primary_language}")
    print(f"Content: {page_result.natural_text}")

Silver Data Generation - Uses these prompts for batch processing
Filter API - Pre-processing PDF validation

Pipeline

Data Processing

Training & Evaluation

Utilities

Prompt Templates

Overview

Prompt Construction

build_openai_silver_data_prompt

Example Usage

Prompt Content

build_finetuning_prompt

Example Usage

extract_raw_text

Example Usage

Response Schema

openai_response_format_schema

Schema Structure

Response Fields

Example Usage

PageResponse Dataclass

Validation

Example Usage

Handling Validation Errors

Anchor Text Generation

PDF Report Format

Complete Example

Build docs developers (and LLMs) love

Pipeline

Data Processing

Training & Evaluation

Utilities

​Overview

​Prompt Construction

​build_openai_silver_data_prompt

​Example Usage

​Prompt Content

​build_finetuning_prompt

​Example Usage

​extract_raw_text

​Example Usage

​Response Schema

​openai_response_format_schema

​Schema Structure

​Response Fields

​Example Usage

​PageResponse Dataclass

​Validation

​Example Usage

​Handling Validation Errors

​Anchor Text Generation

​PDF Report Format

​Complete Example

​Related APIs

Build docs developers (and LLMs) love

Overview

Prompt Construction

build_openai_silver_data_prompt

Example Usage

Prompt Content

build_finetuning_prompt

Example Usage

extract_raw_text

Example Usage

Response Schema

openai_response_format_schema

Schema Structure

Response Fields

Example Usage

PageResponse Dataclass

Validation

Example Usage

Handling Validation Errors

Anchor Text Generation

PDF Report Format

Complete Example

Related APIs