Introduction

Learn more about Mintlify

Enter your email to receive updates about new features and product releases.

Welcome to olmOCR
Get Started
Key Features
Complete PDF Processing Pipeline
Built for Scale
High-Quality Output
What’s Included
Requirements
License
Team

Welcome to olmOCR

olmOCR is a comprehensive toolkit for training language models to work with PDF documents in the wild. Built by the AllenNLP team at the Allen Institute for Artificial Intelligence (AI2), olmOCR provides everything you need to extract natural text from complex PDFs at scale.

Try the online demo at https://olmocr.allenai.org/

Get Started

Installation

Set up olmOCR with GPU support and all dependencies

Quickstart

Convert your first PDF in minutes

Pipeline Documentation

Full reference for the PDF processing pipeline

GitHub Repository

View source code and contribute

Key Features

Complete PDF Processing Pipeline

olmOCR provides a full end-to-end pipeline for converting millions of PDFs into high-quality text:

Smart OCR with ChatGPT 4o - Prompting strategy for natural text parsing using ChatGPT 4o (buildsilver.py:31)
Side-by-side Evaluation - Compare different pipeline versions with built-in eval toolkit (runeval.py:32)
Intelligent Filtering - Basic filtering by language and SEO spam removal (filter.py:33)
Model Fine-tuning - Fine-tuning code for Qwen2-VL and Molmo-O models (train.py:34)
Scalable Inference - Process millions of PDFs using Sglang-powered GPU inference (pipeline.py:35)
Visual Results Viewer - View Dolma docs side-by-side with original PDFs (dolmaviewer.py:36)

Built for Scale

Local Processing

Process PDFs on a single GPU for quick testing

Multi-node Support

Scale to multiple nodes with AWS S3 coordination

Cluster Integration

Built-in Beaker support for massive parallel processing

High-Quality Output

Extracted text is stored in Dolma-style JSONL format, making it easy to integrate with existing NLP pipelines and datasets.

olmOCR handles complex PDFs including scanned documents, multi-column layouts, tables, diagrams, and rotated pages.

What’s Included

The toolkit includes everything you need for PDF processing:

Data Processing - Prompting strategies and silver data generation
Evaluation Tools - Side-by-side comparison of pipeline outputs
Content Filtering - Language detection and spam removal
Training Code - Fine-tune vision-language models on your data
Inference Pipeline - Batch processing with GPU acceleration
Visualization - Preview extracted text alongside original PDFs

Requirements

GPU

Recent NVIDIA GPU (RTX 4090, L40S, A100, or H100)

Storage

30GB of free disk space

License

olmOCR is licensed under Apache 2.0. A full copy of the license can be found on GitHub.

Team

olmOCR is developed and maintained by the AllenNLP team, backed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering. To learn more about who specifically contributed to this codebase, see our contributors page.

Installation

⌘I

Build docs developers (and LLMs) love

Get started for free Talk to us

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

Welcome to olmOCR

Get Started

Installation

Quickstart

Pipeline Documentation

GitHub Repository

Key Features

Complete PDF Processing Pipeline

Built for Scale

Local Processing

Multi-node Support

Cluster Integration

High-Quality Output

What’s Included

Requirements

GPU

Storage

License

Team

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

​Welcome to olmOCR

​Get Started

Installation

Quickstart

Pipeline Documentation

GitHub Repository

​Key Features

​Complete PDF Processing Pipeline

​Built for Scale

Local Processing

Multi-node Support

Cluster Integration

​High-Quality Output

​What’s Included

​Requirements

GPU

Storage

​License

​Team

Build docs developers (and LLMs) love

Welcome to olmOCR

Get Started

Key Features

Complete PDF Processing Pipeline

Built for Scale

High-Quality Output

What’s Included

Requirements

License

Team