Overview
The prompts module provides functions for building prompts for vision-language models, response format schemas for structured outputs, and dataclasses for handling OCR responses.
Prompt Construction
build_openai_silver_data_prompt
Builds the comprehensive prompt used for generating silver training data with GPT-4o.
def build_openai_silver_data_prompt(base_text: str) -> str
Raw anchor text extracted from the PDF page, including position information
Returns: Formatted prompt string instructing the model on OCR extraction
Example Usage
from olmocr.prompts import build_openai_silver_data_prompt
from olmocr.prompts.anchor import get_anchor_text
# Extract anchor text from page
anchor_text = get_anchor_text("/path/to/doc.pdf", page=1, pdf_engine="pdfreport")
# Build the prompt
prompt = build_openai_silver_data_prompt(anchor_text)
Prompt Content
The generated prompt instructs the model to:
- Read naturally - Convert the document page to plain text as if reading naturally
- Format appropriately:
- Equations → LaTeX representation
- Tables → Markdown format
- Remove headers/footers - But keep references and footnotes
- Read handwriting - Process any natural handwriting present
- Preserve continuity - Keep partial sentences from previous/next pages intact
- Allow null - Return null if no readable text exists
- No hallucination - Only extract text that’s actually present
The prompt includes coordinate system information: origin [0x0] is in the lower left corner of the image.
build_finetuning_prompt
Builds a simplified prompt for training and running fine-tuned models.
def build_finetuning_prompt(base_text: str) -> str
Raw anchor text extracted from the PDF page
Returns: Simplified prompt string for fine-tuned model inference
Example Usage
from olmocr.prompts import build_finetuning_prompt
# For fine-tuned model inference
prompt = build_finetuning_prompt(anchor_text)
This prompt is simplified compared to the silver data prompt, focusing on core OCR extraction without detailed formatting instructions.
Extracts the anchor text component from an existing prompt string.
def extract_raw_text(prompt: str) -> str
Complete prompt string containing RAW_TEXT_START/END markers
Returns: Extracted anchor text without prompt wrapper
Raises: ValueError if prompt doesn’t contain raw text markers
Example Usage
from olmocr.prompts import extract_raw_text
prompt = "Below is the image...RAW_TEXT_START\nPage content\nRAW_TEXT_END"
anchor_text = extract_raw_text(prompt)
# Returns: "Page content"
Response Schema
Returns the structured output schema for OpenAI’s JSON mode, ensuring consistent response format.
def openai_response_format_schema() -> dict
Returns: OpenAI-compatible JSON schema dictionary
Schema Structure
{
"type": "json_schema",
"json_schema": {
"name": "page_response",
"strict": true,
"schema": {
"type": "object",
"properties": { /* See below */ },
"required": [
"primary_language",
"is_rotation_valid",
"rotation_correction",
"is_table",
"is_diagram",
"natural_text"
],
"additionalProperties": false
}
}
}
Response Fields
Two-letter language code (e.g., “en”, “es”, “fr”) or null if no readable text exists
Whether the page is oriented correctly for reading. Only considers textual content, not charts/figures.
Degrees of clockwise rotation needed if page is incorrectly oriented.Allowed values: 0, 90, 180, 270Default: 0
Whether the majority of page content is in tabular format
Whether the majority of page content is a visual diagram
The natural text content extracted from the page, or null if no text should be read
Example Usage
from olmocr.prompts import openai_response_format_schema
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-2024-08-06",
messages=[...],
response_format=openai_response_format_schema()
)
# Parse the structured response
import json
result = json.loads(response.choices[0].message.content)
print(result["natural_text"])
Typed dataclass for handling OCR responses with validation.
@dataclass(frozen=True)
class PageResponse:
primary_language: Optional[str]
is_rotation_valid: bool
rotation_correction: int
is_table: bool
is_diagram: bool
natural_text: Optional[str]
Validation
The dataclass performs automatic validation in __post_init__:
rotation_correction must be one of: 0, 90, 180, 270
- Type checking for all fields
- Raises
ValueError or TypeError for invalid data
Example Usage
from olmocr.prompts import PageResponse
import json
# Parse API response
response_data = json.loads(api_response)
# Create validated PageResponse
page = PageResponse(
primary_language=response_data["primary_language"],
is_rotation_valid=response_data["is_rotation_valid"],
rotation_correction=response_data["rotation_correction"],
is_table=response_data["is_table"],
is_diagram=response_data["is_diagram"],
natural_text=response_data["natural_text"]
)
# Access fields
if page.natural_text:
print(f"Extracted text in {page.primary_language}: {page.natural_text}")
if not page.is_rotation_valid:
print(f"Rotate {page.rotation_correction} degrees clockwise")
Handling Validation Errors
from olmocr.prompts import PageResponse
try:
page = PageResponse(
primary_language="en",
is_rotation_valid=True,
rotation_correction=45, # Invalid! Must be 0, 90, 180, or 270
is_table=False,
is_diagram=False,
natural_text="Sample text"
)
except ValueError as e:
print(f"Invalid rotation: {e}")
Anchor Text Generation
The anchor.py module provides get_anchor_text() for extracting positional text from PDFs:
from olmocr.prompts.anchor import get_anchor_text
anchor_text = get_anchor_text(
local_pdf_path="/path/to/doc.pdf",
page=1,
pdf_engine="pdfreport",
target_length=4000
)
Path to the local PDF file
Engine to use for text extraction:
"pdftotext" - Uses poppler’s pdftotext
"pdfium" - Uses pdfium library
"pypdf" - Uses PyPDF
"topcoherency" - Tests multiple engines and picks the most coherent
"pdfreport" - Full positional report with images and text coordinates
Target length for anchor text output (used with pdfreport engine)
When using pdf_engine="pdfreport", the output includes:
Page dimensions: 612.0x792.0
[Image 50x100 to 200x300]
[72x720]Document Title Here
[72x680]First paragraph of text...
[150x400]Table cell content
- Page dimensions - Width x Height in points
- Images -
[Image x0xy0 to x1xy1] format with bounding boxes
- Text -
[xxy]text content with position coordinates
- Origin -
[0x0] is in the lower left corner
Complete Example
from olmocr.prompts import (
build_openai_silver_data_prompt,
openai_response_format_schema,
PageResponse
)
from olmocr.prompts.anchor import get_anchor_text
from openai import OpenAI
import json
# 1. Extract anchor text
anchor_text = get_anchor_text(
"/path/to/document.pdf",
page=1,
pdf_engine="pdfreport",
target_length=4000
)
# 2. Build prompt
prompt = build_openai_silver_data_prompt(anchor_text)
# 3. Call OpenAI API
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-2024-08-06",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}}
]
}
],
temperature=0.1,
max_tokens=6000,
response_format=openai_response_format_schema()
)
# 4. Parse and validate response
result_data = json.loads(response.choices[0].message.content)
page_result = PageResponse(**result_data)
# 5. Use the extracted data
if page_result.natural_text:
print(f"Language: {page_result.primary_language}")
print(f"Content: {page_result.natural_text}")