Welcome to olmOCR
olmOCR is a comprehensive toolkit for training language models to work with PDF documents in the wild. Built by the AllenNLP team at the Allen Institute for Artificial Intelligence (AI2), olmOCR provides everything you need to extract natural text from complex PDFs at scale.Try the online demo at https://olmocr.allenai.org/
Get Started
Installation
Set up olmOCR with GPU support and all dependencies
Quickstart
Convert your first PDF in minutes
Pipeline Documentation
Full reference for the PDF processing pipeline
GitHub Repository
View source code and contribute
Key Features
Complete PDF Processing Pipeline
olmOCR provides a full end-to-end pipeline for converting millions of PDFs into high-quality text:- Smart OCR with ChatGPT 4o - Prompting strategy for natural text parsing using ChatGPT 4o (buildsilver.py:31)
- Side-by-side Evaluation - Compare different pipeline versions with built-in eval toolkit (runeval.py:32)
- Intelligent Filtering - Basic filtering by language and SEO spam removal (filter.py:33)
- Model Fine-tuning - Fine-tuning code for Qwen2-VL and Molmo-O models (train.py:34)
- Scalable Inference - Process millions of PDFs using Sglang-powered GPU inference (pipeline.py:35)
- Visual Results Viewer - View Dolma docs side-by-side with original PDFs (dolmaviewer.py:36)
Built for Scale
Local Processing
Process PDFs on a single GPU for quick testing
Multi-node Support
Scale to multiple nodes with AWS S3 coordination
Cluster Integration
Built-in Beaker support for massive parallel processing
High-Quality Output
Extracted text is stored in Dolma-style JSONL format, making it easy to integrate with existing NLP pipelines and datasets.What’s Included
The toolkit includes everything you need for PDF processing:- Data Processing - Prompting strategies and silver data generation
- Evaluation Tools - Side-by-side comparison of pipeline outputs
- Content Filtering - Language detection and spam removal
- Training Code - Fine-tune vision-language models on your data
- Inference Pipeline - Batch processing with GPU acceleration
- Visualization - Preview extracted text alongside original PDFs
Requirements
GPU
Recent NVIDIA GPU (RTX 4090, L40S, A100, or H100)
Storage
30GB of free disk space