What is olmOCR?
olmOCR is a comprehensive toolkit for training language models to work with PDF documents in the wild. It transforms PDFs into high-quality training data by combining vision-language models with intelligent text extraction techniques.PDF Processing
Convert millions of PDFs to LLM-ready text using fine-tuned vision models
Training Pipeline
Fine-tune Qwen2-VL and Molmo-O models on your PDF data
Evaluation Tools
Side-by-side comparison of different pipeline versions
Distributed Processing
Scale to millions of documents with S3 and work queue coordination
Core Components
olmOCR consists of four main components that work together to create a complete PDF-to-training-data pipeline:1. Data Preparation (buildsilver.py)
Creates high-quality silver training data using ChatGPT 4o with a specialized prompting strategy. This component:
- Extracts natural text from PDF pages using vision models
- Combines anchor text hints with visual understanding
- Produces ground truth examples for fine-tuning
2. Inference Pipeline (pipeline.py)
The heart of olmOCR - processes millions of PDFs through fine-tuned models:
- Local mode: Process PDFs on a single GPU machine
- Distributed mode: Coordinate work across multiple nodes via S3
- Work queue: Manages job distribution and prevents duplicate processing
- Filtering: Removes low-quality documents (SEO spam, non-English, forms)
3. Training (train.py)
Fine-tunes vision-language models to understand PDF layouts:
- Supports Qwen2-VL and Molmo-O architectures
- Trains models to extract natural text from rendered PDF images
- Uses anchor text as context to improve extraction quality
4. Evaluation (runeval.py)
Compares different pipeline versions side-by-side:
- Visual comparison of original PDFs and extracted text
- Metrics for comparing extraction quality
- Dolma viewer for inspecting results
How It Works
The olmOCR workflow follows these steps:Step-by-Step Process
- PDF Rendering: Each page is rendered to a PNG image at a target resolution (default 1024px longest side)
- Anchor Text Extraction: Multiple extraction methods (pdftotext, pdfium, pypdf) extract raw text to provide context hints
- Vision Model Inference: A fine-tuned VLM (Qwen2-VL or Molmo-O) processes the image and anchor text to produce natural text
-
Quality Control: Filtering removes documents that are:
- Non-English
- SEO spam or low coherency
- Forms with mostly structured data
- Pages with excessive errors
- Output Format: Results are saved in Dolma JSONL format with metadata:
Architecture Principles
Fault Tolerance
The pipeline is designed to handle failures gracefully:- Page-level retries: Failed pages are retried up to 8 times with exponential backoff
- Fallback extraction: If vision model fails, falls back to pdftotext
- Error rate threshold: Documents with >0.4% failed pages are discarded
- Lock files: Prevent duplicate processing in distributed environments
Scalability
Async/Await
Asynchronous processing maximizes GPU utilization
Work Queues
Distributes work across multiple GPU workers
S3 Coordination
Shared state via S3 for multi-node processing
Process Pools
CPU-bound tasks (anchor text) run in separate processes
Efficiency
Key optimizations for processing millions of PDFs:- Batching: Groups ~500 pages per work item for optimal GPU utilization
- Concurrent workers: Default 8 workers per GPU node
- Semaphore control: Ensures GPU stays saturated without memory overflow
- Manual HTTP: Custom async HTTP implementation avoids connection pool deadlocks at scale
At 100M+ requests, standard HTTP libraries can experience deadlocks. olmOCR uses a custom
apost() function to avoid this issue.Output and Results
Processed documents are stored in Dolma format with:- Text: Natural reading-order text extracted from the PDF
- Page spans: Character offsets mapping text back to original page numbers
- Metadata: Token counts, version info, source file location
- Quality metrics: Fallback page counts, total pages processed
Viewing Results
Use the built-in Dolma viewer to see side-by-side comparisons:Next Steps
Pipeline Details
Learn about the inference pipeline architecture
Anchor Text
Understand how anchor text improves extraction
Quick Start
Start processing PDFs in minutes
API Reference
Explore the API documentation