Silver Data Generation

Overview

The buildsilver module provides functionality for generating silver training data from PDF documents using OpenAI’s GPT-4o model via the Batch API. It processes PDF pages, extracts anchor text, and constructs queries for vision-language model processing.

Core Function

build_page_query

Constructs an OpenAI Batch API request for processing a single PDF page.

def build_page_query(local_pdf_path: str, pretty_pdf_path: str, page: int) -> dict

local_pdf_path

str

required

Path to the local PDF file to process

pretty_pdf_path

str

required

Display-friendly path used in the custom_id field for tracking

page

int

required

Page number to process (1-indexed)

Returns: Dictionary containing OpenAI Batch API request format

Example Usage

from olmocr.data.buildsilver import build_page_query

# Generate a batch query for page 5 of a PDF
query = build_page_query(
    local_pdf_path="/path/to/document.pdf",
    pretty_pdf_path="s3://bucket/document.pdf",
    page=5
)

# Write to JSONL file for batch processing
import json
with open("batch_requests.jsonl", "a") as f:
    f.write(json.dumps(query) + "\n")

OpenAI Batch API Format

The function returns a structured dictionary following OpenAI’s Batch API format:

{
  "custom_id": "document.pdf-5",
  "method": "POST",
  "url": "/v1/chat/completions",
  "body": {
    "model": "gpt-4o-2024-08-06",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "[Prompt with anchor text]"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "data:image/png;base64,[base64_encoded_image]"
            }
          }
        ]
      }
    ],
    "temperature": 0.1,
    "max_tokens": 6000,
    "logprobs": true,
    "top_logprobs": 5,
    "response_format": { /* structured output schema */ }
  }
}

Request Structure

custom_id

string

Format: {pretty_pdf_path}-{page} - Used to track results back to source

body.model

string

Model: gpt-4o-2024-08-06 - Vision-capable model for OCR tasks

body.temperature

float

Set to 0.1 for more deterministic outputs

body.max_tokens

int

Maximum of 6000 tokens for comprehensive page extraction

body.logprobs

boolean

Enabled to help identify problematic responses later

body.response_format

object

Structured output schema ensuring consistent JSON responses (see Prompts API)

Prompt Construction

The query uses build_openai_silver_data_prompt() to create the text component:

from olmocr.prompts import build_openai_silver_data_prompt
from olmocr.prompts.anchor import get_anchor_text

# Extract anchor text from PDF
anchor_text = get_anchor_text(local_pdf_path, page, pdf_engine="pdfreport")

# Build the prompt
prompt = build_openai_silver_data_prompt(anchor_text)

See Prompts API for details on prompt construction.

Image Processing

Pages are rendered to base64-encoded PNG images at 2048px resolution:

TARGET_IMAGE_DIM = 2048

image_base64 = render_pdf_to_base64png(local_pdf_path, page, TARGET_IMAGE_DIM)

Best Practices

Use Batch API

The Batch API offers 50% cost savings compared to real-time API calls with the same performance.

Structured Outputs

Always use structured outputs for data processing tasks to eliminate LLM “fluff text” and ensure consistent parsing.

Schema Ordering

Fields in your schema are answered in order - place chain-of-thought fields first for better results.

Request Logprobs

Enable logprobs at no extra cost to identify low-confidence or problematic responses during analysis.

Check for Typos

Prompt quality directly impacts performance - review prompts carefully before processing thousands of documents.

Helper Functions

sample_pdf_pages

def sample_pdf_pages(num_pages: int, first_n_pages: int, max_sample_pages: int) -> list

Samples pages from a PDF, always including the first N pages and randomly sampling from the rest.

fetch_s3_file

def fetch_s3_file(s3_url: str, local_path: str) -> str

Downloads a file from S3 to a local path.

process_pdf

def process_pdf(pdf_path: str, first_n_pages: int, max_sample_pages: int, no_filter: bool) -> Generator[dict, None, None]

Processes a complete PDF document, generating batch queries for sampled pages. Supports both local and S3 paths (s3://bucket/key).

Prompts API - Prompt construction and response schemas
Filter API - PDF filtering before processing

Pipeline

Data Processing

Training & Evaluation

Utilities

Silver Data Generation

Overview

Core Function

build_page_query

Example Usage

OpenAI Batch API Format

Request Structure

Prompt Construction

Image Processing

Best Practices

Use Batch API

Structured Outputs

Schema Ordering

Request Logprobs

Check for Typos

Helper Functions

sample_pdf_pages

fetch_s3_file

process_pdf

Build docs developers (and LLMs) love

Pipeline

Data Processing

Training & Evaluation

Utilities

​Overview

​Core Function

​build_page_query

​Example Usage

​OpenAI Batch API Format

​Request Structure

​Prompt Construction

​Image Processing

​Best Practices

Use Batch API

Structured Outputs

Schema Ordering

Request Logprobs

Check for Typos

​Helper Functions

​sample_pdf_pages

​fetch_s3_file

​process_pdf

​Related APIs

Build docs developers (and LLMs) love

Overview

Core Function

build_page_query

Example Usage

OpenAI Batch API Format

Request Structure

Prompt Construction

Image Processing

Best Practices

Helper Functions

sample_pdf_pages

fetch_s3_file

process_pdf

Related APIs