Creating Silver Training Data

Overview

Silver training data is synthetic training data generated using ChatGPT 4o to convert PDF pages into clean, structured text. This process uses the OpenAI Batch API for cost-effective processing at scale.

The Batch API is 50% cheaper than the standard API with the same performance. For large-scale data generation, this can result in significant cost savings.

Silver Data Generation Strategy

The silver data generation process combines:

Visual rendering - PDFs rendered to PNG images at 2048px resolution
Anchor text - Raw text extracted with positional information
ChatGPT 4o - Vision model to generate clean text output
Structured outputs - JSON schema for consistent responses

Why “Silver” Data?

Silver data sits between gold (human-annotated) and bronze (raw) data:

More scalable than gold data (no manual annotation)
Higher quality than bronze data (AI-cleaned)
Cost-effective for training OCR models

Using buildsilver.py

Basic Usage

python -m olmocr.data.buildsilver \
  --glob_path "data/*.pdf" \
  --num_sample_docs 5000 \
  --max_sample_pages 15 \
  --output openai_batch_data

Arguments

glob_path

string

Local or S3 path glob (e.g., *.pdf or s3://bucket/pdfs/*.pdf)

path_list

string

Path to a file containing PDF paths, one per line

num_sample_docs

integer

default:"5000"

Number of PDF documents to sample

first_n_pages

integer

default:"0"

Always sample the first N pages of each PDF

max_sample_pages

integer

default:"15"

Max number of pages to sample per PDF

no_filter

boolean

Disable basic spam/language filtering

output

string

default:"openai_batch_data"

Output directory for JSONL batch files

reservoir_size

integer

Size of the reservoir for sampling paths. Defaults to 10x num_sample_docs

S3 Integration

Process PDFs directly from S3:

python -m olmocr.data.buildsilver \
  --glob_path "s3://my-bucket/pdfs/*.pdf" \
  --num_sample_docs 10000 \
  --output batch_requests

The script automatically:

Downloads PDFs from S3 to /tmp
Processes them locally
Cleans up temporary files

Page Sampling Strategy

The script uses intelligent page sampling (see sample_pdf_pages in buildsilver.py:87-94):

def sample_pdf_pages(num_pages: int, first_n_pages: int, max_sample_pages: int) -> list:
    if num_pages <= first_n_pages:
        return list(range(1, num_pages + 1))
    sample_pages = list(range(1, first_n_pages + 1))
    remaining_pages = list(range(first_n_pages + 1, num_pages + 1))
    if remaining_pages:
        sample_pages += random.sample(
            remaining_pages, 
            min(max_sample_pages - first_n_pages, len(remaining_pages))
        )
    return sample_pages

This ensures you always get the first pages (often most important) plus random samples from the rest.

ChatGPT 4o Prompting Approach

Prompt Structure

The prompt (from prompts.py:7-17) asks ChatGPT 4o to:

Read the PDF page image naturally
Convert equations to LaTeX
Convert tables to markdown
Remove headers/footers but keep references and footnotes
Read any handwriting
Preserve sentences that span pages
Return null if no readable text exists
Not hallucinate

def build_page_query(local_pdf_path: str, pretty_pdf_path: str, page: int) -> dict:
    image_base64 = render_pdf_to_base64png(local_pdf_path, page, TARGET_IMAGE_DIM)
    anchor_text = get_anchor_text(local_pdf_path, page, pdf_engine="pdfreport")
    
    return {
        "custom_id": f"{pretty_pdf_path}-{page}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-2024-08-06",
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": build_openai_silver_data_prompt(anchor_text)},
                        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}},
                    ],
                }
            ],
            "temperature": 0.1,
            "max_tokens": 6000,
            "logprobs": True,
            "top_logprobs": 5,
            "response_format": openai_response_format_schema(),
        },
    }

Structured Output Schema

The response follows a strict JSON schema (prompts.py:49-96):

{
  "primary_language": "en",
  "is_rotation_valid": true,
  "rotation_correction": 0,
  "is_table": false,
  "is_diagram": false,
  "natural_text": "The extracted clean text..."
}

Field Descriptions

primary_language: Two-letter language code or null if no text
is_rotation_valid: Whether the page is oriented correctly for reading
rotation_correction: Degrees of clockwise rotation needed (0, 90, 180, or 270)
is_table: Whether majority of content is tabular
is_diagram: Whether majority of content is a visual diagram
natural_text: The cleaned, natural text content

OpenAI Batch API Usage

Output Format

The script generates JSONL files in the OpenAI Batch API format (buildsilver.py:206-266):

{"custom_id": "document.pdf-1", "method": "POST", "url": "/v1/chat/completions", "body": {...}}
{"custom_id": "document.pdf-2", "method": "POST", "url": "/v1/chat/completions", "body": {...}}

Files are automatically split at 99MB to stay under OpenAI’s 100MB batch file limit.

Submitting Batches

After generating batch files, use the runopenaibatch.py script to automatically submit, monitor, and download results:

# Set your OpenAI API key
export OPENAI_API_KEY=your_api_key_here

# Submit batches (automatically manages 100GB storage limit)
python -m olmocr.data.runopenaibatch batch_requests

# Process with custom storage limit (default: 25GB at a time)
python -m olmocr.data.runopenaibatch --max_gb 50 batch_requests

The script automatically handles OpenAI’s 100GB storage limit by uploading batches incrementally, waiting for completion, and downloading results before uploading more.

Alternatively, use the OpenAI CLI directly:

# Submit batch
openai api batches.create \
  --input-file batch_requests/output_0.jsonl \
  --endpoint /v1/chat/completions

# Check status
openai api batches.retrieve <batch_id>

# Download results
openai api batches.download <batch_id> > results.jsonl

Processing Results

Once you have results, extract the natural text:

import json

with open('results.jsonl', 'r') as f:
    for line in f:
        result = json.loads(line)
        response = result['response']['body']['choices'][0]['message']['content']
        data = json.loads(response)
        
        custom_id = result['custom_id']
        natural_text = data['natural_text']
        
        print(f"{custom_id}: {natural_text[:100]}...")

Expert Tips on Structured Outputs

These tips come directly from the code comments in buildsilver.py:56-62 and are based on real experience processing thousands of documents.

1. Use Batch API for Cost Savings

“Use the batch query system, it’s 1/2 the price and exactly the same performance”

The Batch API provides 50% cost reduction with no performance penalty.

2. Always Use Structured Outputs

“If your application is not an actual chatbot, use structured outputs!”

Even if the last 10 queries returned perfect results, unstructured outputs won’t scale to thousands of queries. You’ll get inconsistent formatting and LLM fluff text.

3. Field Order Matters

“The order in which fields are in your schema, is the order in which the model will answer them”

Place preparatory or chain-of-thought fields first:

{
  "is_rotation_valid": bool,  // Answer first
  "rotation_correction": int,  // Then this
  "natural_text": str          // Finally the main content
}

This gives better answers for later fields.

4. Check for Typos

“Check your prompt for typos, it makes a performance difference!”

Typos in prompts degrade model performance. Review prompts carefully.

5. Request Logprobs

“Ask for logprobs, it’s not any more expensive and you can use them later to help identify problematic responses”

Set "logprobs": True and "top_logprobs": 5 to get confidence scores for filtering low-quality responses later.

Output Structure

The script creates:

openai_batch_data/
├── output_0.jsonl  (up to 99MB)
├── output_1.jsonl  (up to 99MB)
└── output_2.jsonl  (up to 99MB)

Each file contains batch requests ready for OpenAI.

Filtering Integration

By default, the script uses PdfFilter to exclude:

Non-English documents
PDF forms
SEO spam PDFs

See the Filtering guide for details. To disable filtering:

python -m olmocr.data.buildsilver \
  --glob_path "data/*.pdf" \
  --no_filter

Performance Considerations

Parallel Processing

The script uses ProcessPoolExecutor for parallel PDF processing (buildsilver.py:222-264):

with ProcessPoolExecutor() as executor:
    futures = []
    for pdf_path in pdf_paths:
        futures.append(executor.submit(
            process_pdf, 
            pdf_path, 
            args.first_n_pages, 
            args.max_sample_pages, 
            args.no_filter
        ))

Reservoir Sampling

For large datasets, reservoir sampling (buildsilver.py:152-201) ensures uniform random sampling without loading all paths into memory:

# Only keep reservoir_size paths in memory
if len(pdf_paths) < args.reservoir_size:
    pdf_paths.append(path)
else:
    s = random.randint(1, n)
    if s <= args.reservoir_size:
        pdf_paths[s - 1] = path

Default reservoir size is 10x the target document count.

Next Steps

Build Test Sets

Create evaluation datasets

Filter PDFs

Configure PDF filtering

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

Creating Silver Training Data

Overview

Silver Data Generation Strategy

Why “Silver” Data?

Using buildsilver.py

Basic Usage

Arguments

S3 Integration

Page Sampling Strategy

ChatGPT 4o Prompting Approach

Prompt Structure

Structured Output Schema

OpenAI Batch API Usage

Output Format

Submitting Batches

Processing Results

Expert Tips on Structured Outputs

1. Use Batch API for Cost Savings

2. Always Use Structured Outputs

3. Field Order Matters

4. Check for Typos

5. Request Logprobs

Output Structure

Filtering Integration

Performance Considerations

Parallel Processing

Reservoir Sampling

Next Steps

Build Test Sets

Filter PDFs

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

​Overview

​Silver Data Generation Strategy

​Why “Silver” Data?

​Using buildsilver.py

​Basic Usage

​Arguments

​S3 Integration

​Page Sampling Strategy

​ChatGPT 4o Prompting Approach

​Prompt Structure

​Structured Output Schema

​OpenAI Batch API Usage

​Output Format

​Submitting Batches

​Processing Results

​Expert Tips on Structured Outputs

​1. Use Batch API for Cost Savings

​2. Always Use Structured Outputs

​3. Field Order Matters

​4. Check for Typos

​5. Request Logprobs

​Output Structure

​Filtering Integration

​Performance Considerations

​Parallel Processing

​Reservoir Sampling

​Next Steps

Build Test Sets

Filter PDFs

Build docs developers (and LLMs) love

Overview

Silver Data Generation Strategy

Why “Silver” Data?

Using buildsilver.py

Basic Usage

Arguments

S3 Integration

Page Sampling Strategy

ChatGPT 4o Prompting Approach

Prompt Structure

Structured Output Schema

OpenAI Batch API Usage

Output Format

Submitting Batches

Processing Results

Expert Tips on Structured Outputs

1. Use Batch API for Cost Savings

2. Always Use Structured Outputs

3. Field Order Matters

4. Check for Typos

5. Request Logprobs

Output Structure

Filtering Integration

Performance Considerations

Parallel Processing

Reservoir Sampling

Next Steps