Architecture Overview

What is olmOCR?

olmOCR is a comprehensive toolkit for training language models to work with PDF documents in the wild. It transforms PDFs into high-quality training data by combining vision-language models with intelligent text extraction techniques.

PDF Processing

Convert millions of PDFs to LLM-ready text using fine-tuned vision models

Training Pipeline

Fine-tune Qwen2-VL and Molmo-O models on your PDF data

Evaluation Tools

Side-by-side comparison of different pipeline versions

Distributed Processing

Scale to millions of documents with S3 and work queue coordination

Core Components

olmOCR consists of four main components that work together to create a complete PDF-to-training-data pipeline:

1. Data Preparation (`buildsilver.py`)

Creates high-quality silver training data using ChatGPT 4o with a specialized prompting strategy. This component:

Extracts natural text from PDF pages using vision models
Combines anchor text hints with visual understanding
Produces ground truth examples for fine-tuning

2. Inference Pipeline (`pipeline.py`)

The heart of olmOCR - processes millions of PDFs through fine-tuned models:

Local mode: Process PDFs on a single GPU machine
Distributed mode: Coordinate work across multiple nodes via S3
Work queue: Manages job distribution and prevents duplicate processing
Filtering: Removes low-quality documents (SEO spam, non-English, forms)

The pipeline can process PDFs locally or scale to millions of documents using S3 and distributed workers on Beaker.

3. Training (`train.py`)

Fine-tunes vision-language models to understand PDF layouts:

Supports Qwen2-VL and Molmo-O architectures
Trains models to extract natural text from rendered PDF images
Uses anchor text as context to improve extraction quality

4. Evaluation (`runeval.py`)

Compares different pipeline versions side-by-side:

Visual comparison of original PDFs and extracted text
Metrics for comparing extraction quality
Dolma viewer for inspecting results

How It Works

The olmOCR workflow follows these steps:

Step-by-Step Process

PDF Rendering: Each page is rendered to a PNG image at a target resolution (default 1024px longest side)
Anchor Text Extraction: Multiple extraction methods (pdftotext, pdfium, pypdf) extract raw text to provide context hints
Vision Model Inference: A fine-tuned VLM (Qwen2-VL or Molmo-O) processes the image and anchor text to produce natural text
Quality Control: Filtering removes documents that are:
- Non-English
- SEO spam or low coherency
- Forms with mostly structured data
- Pages with excessive errors
Output Format: Results are saved in Dolma JSONL format with metadata:

{
  "id": "<sha1-hash>",
  "text": "Extracted natural text...",
  "source": "olmocr",
  "metadata": {
    "Source-File": "s3://bucket/path/to/file.pdf",
    "olmocr-version": "0.1.0",
    "pdf-total-pages": 10,
    "total-input-tokens": 15000,
    "total-output-tokens": 3000,
    "total-fallback-pages": 0
  },
  "attributes": {
    "pdf_page_numbers": [[0, 500, 1], [501, 1200, 2]]
  }
}

Architecture Principles

Fault Tolerance

The pipeline is designed to handle failures gracefully:

Page-level retries: Failed pages are retried up to 8 times with exponential backoff
Fallback extraction: If vision model fails, falls back to pdftotext
Error rate threshold: Documents with >0.4% failed pages are discarded
Lock files: Prevent duplicate processing in distributed environments

Scalability

Async/Await

Asynchronous processing maximizes GPU utilization

Work Queues

Distributes work across multiple GPU workers

S3 Coordination

Shared state via S3 for multi-node processing

Process Pools

CPU-bound tasks (anchor text) run in separate processes

Efficiency

Key optimizations for processing millions of PDFs:

Batching: Groups ~500 pages per work item for optimal GPU utilization
Concurrent workers: Default 8 workers per GPU node
Semaphore control: Ensures GPU stays saturated without memory overflow
Manual HTTP: Custom async HTTP implementation avoids connection pool deadlocks at scale

At 100M+ requests, standard HTTP libraries can experience deadlocks. olmOCR uses a custom apost() function to avoid this issue.

Output and Results

Processed documents are stored in Dolma format with:

Text: Natural reading-order text extracted from the PDF
Page spans: Character offsets mapping text back to original page numbers
Metadata: Token counts, version info, source file location
Quality metrics: Fallback page counts, total pages processed

Viewing Results

Use the built-in Dolma viewer to see side-by-side comparisons:

python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl

This generates HTML files showing the original PDF alongside the extracted text.

Next Steps

Pipeline Details

Learn about the inference pipeline architecture

Anchor Text

Understand how anchor text improves extraction

Quick Start

Start processing PDFs in minutes

API Reference

Explore the API documentation

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

Architecture Overview

What is olmOCR?

PDF Processing

Training Pipeline

Evaluation Tools

Distributed Processing

Core Components

1. Data Preparation (`buildsilver.py`)

2. Inference Pipeline (`pipeline.py`)

3. Training (`train.py`)

4. Evaluation (`runeval.py`)

How It Works

Step-by-Step Process

Architecture Principles

Fault Tolerance

Scalability

Async/Await

Work Queues

S3 Coordination

Process Pools

Efficiency

Output and Results

Viewing Results

Next Steps

Pipeline Details

Anchor Text

Quick Start

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

​What is olmOCR?

PDF Processing

Training Pipeline

Evaluation Tools

Distributed Processing

​Core Components

​1. Data Preparation (buildsilver.py)

​2. Inference Pipeline (pipeline.py)

​3. Training (train.py)

​4. Evaluation (runeval.py)

​How It Works

​Step-by-Step Process

​Architecture Principles

​Fault Tolerance

​Scalability

Async/Await

Work Queues

S3 Coordination

Process Pools

​Efficiency

​Output and Results

​Viewing Results

​Next Steps

Pipeline Details

Anchor Text

Quick Start

API Reference

Build docs developers (and LLMs) love

What is olmOCR?

Core Components

1. Data Preparation (`buildsilver.py`)

2. Inference Pipeline (`pipeline.py`)

3. Training (`train.py`)

4. Evaluation (`runeval.py`)

How It Works

Step-by-Step Process

Architecture Principles

Fault Tolerance

Scalability

Efficiency

Output and Results

Viewing Results

Next Steps