Skip to main content
olmOCR Logo

Welcome to olmOCR

olmOCR is a comprehensive toolkit for training language models to work with PDF documents in the wild. Built by the AllenNLP team at the Allen Institute for Artificial Intelligence (AI2), olmOCR provides everything you need to extract natural text from complex PDFs at scale.
Try the online demo at https://olmocr.allenai.org/

Get Started

Installation

Set up olmOCR with GPU support and all dependencies

Quickstart

Convert your first PDF in minutes

Pipeline Documentation

Full reference for the PDF processing pipeline

GitHub Repository

View source code and contribute

Key Features

Complete PDF Processing Pipeline

olmOCR provides a full end-to-end pipeline for converting millions of PDFs into high-quality text:
  • Smart OCR with ChatGPT 4o - Prompting strategy for natural text parsing using ChatGPT 4o (buildsilver.py:31)
  • Side-by-side Evaluation - Compare different pipeline versions with built-in eval toolkit (runeval.py:32)
  • Intelligent Filtering - Basic filtering by language and SEO spam removal (filter.py:33)
  • Model Fine-tuning - Fine-tuning code for Qwen2-VL and Molmo-O models (train.py:34)
  • Scalable Inference - Process millions of PDFs using Sglang-powered GPU inference (pipeline.py:35)
  • Visual Results Viewer - View Dolma docs side-by-side with original PDFs (dolmaviewer.py:36)

Built for Scale

Local Processing

Process PDFs on a single GPU for quick testing

Multi-node Support

Scale to multiple nodes with AWS S3 coordination

Cluster Integration

Built-in Beaker support for massive parallel processing

High-Quality Output

Extracted text is stored in Dolma-style JSONL format, making it easy to integrate with existing NLP pipelines and datasets.
olmOCR handles complex PDFs including scanned documents, multi-column layouts, tables, diagrams, and rotated pages.

What’s Included

The toolkit includes everything you need for PDF processing:
  • Data Processing - Prompting strategies and silver data generation
  • Evaluation Tools - Side-by-side comparison of pipeline outputs
  • Content Filtering - Language detection and spam removal
  • Training Code - Fine-tune vision-language models on your data
  • Inference Pipeline - Batch processing with GPU acceleration
  • Visualization - Preview extracted text alongside original PDFs

Requirements

GPU

Recent NVIDIA GPU (RTX 4090, L40S, A100, or H100)

Storage

30GB of free disk space

License

olmOCR is licensed under Apache 2.0. A full copy of the license can be found on GitHub.

Team

olmOCR is developed and maintained by the AllenNLP team, backed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering. To learn more about who specifically contributed to this codebase, see our contributors page.

Build docs developers (and LLMs) love