Skip to main content

Overview

This guide covers how to use olmOCR locally to convert PDFs to structured text. Local usage is ideal for testing, small batches, or when you have a single GPU machine available.

Prerequisites

Before you begin, ensure you have:
  • A recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100)
  • At least 30GB of free disk space
  • GPU with sufficient memory (40GB+ recommended for optimal performance)
  • Properly configured olmOCR environment (see Installation)

Setting Up Your Local Workspace

The pipeline uses a local workspace directory to store:
  • Temporary processing files
  • Work queue state
  • Output results in Dolma format
1

Create a workspace directory

Choose a location with sufficient disk space:
mkdir ./localworkspace
2

Verify your GPU is available

The pipeline automatically checks for GPU availability. If no GPU is detected, you’ll see an error message.
3

Choose your model

By default, olmOCR uses allenai/olmOCR-7B-0225-preview from Hugging Face. You can specify a different model with the --model flag.

Converting PDFs

Single PDF Conversion

Convert a single PDF document:
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
This will:
  1. Start an sglang inference server on port 30024
  2. Process each page of the PDF
  3. Save results to ./localworkspace/results/output_*.jsonl

Multiple PDF Conversion

Convert multiple PDFs using glob patterns:
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf
You can also specify multiple individual files:
python -m olmocr.pipeline ./localworkspace --pdfs file1.pdf file2.pdf file3.pdf

Converting from a File List

If you have a text file with one PDF path per line:
python -m olmocr.pipeline ./localworkspace --pdfs pdf_list.txt

Configuration Options

Model Configuration

--model
string
default:"allenai/olmOCR-7B-0225-preview"
Path to the model on Hugging Face or local filesystem. The model will be downloaded and cached automatically.
--model_max_context
integer
default:"8192"
Maximum context length for the model. Requests exceeding this will automatically reduce anchor text length.
--model_chat_template
string
default:"qwen2-vl"
Chat template format to use with the sglang server. Must match your model’s expected format.

Image Processing

--target_longest_image_dim
integer
default:"1024"
The longest dimension (width or height) for rendered PDF page images. Higher values provide better quality but require more GPU memory and processing time.Recommendations:
  • 1024: Good balance for most documents (default)
  • 1536: Better quality for dense or small text
  • 768: Faster processing for simple documents
--target_anchor_text_len
integer
default:"6000"
Maximum characters of anchor text to provide as context to the model. Anchor text is extracted using pdftotext and helps the model understand the page content.This value is automatically reduced if the total context exceeds --model_max_context.

Processing Control

--workers
integer
default:"8"
Number of concurrent workers processing PDFs. More workers can improve throughput but also increase memory usage.Recommendations:
  • 8-16: Single GPU with good memory (40GB+)
  • 4-8: GPUs with limited memory (24GB)
  • 1-4: For debugging or memory-constrained environments
--pages_per_group
integer
default:"500"
Target number of PDF pages per work item group. The pipeline samples your PDFs to estimate pages per document and groups work accordingly.
--max_page_retries
integer
default:"8"
Maximum retry attempts for processing a page before falling back to simple text extraction.
--max_page_error_rate
float
default:"0.004"
Maximum allowable rate of failed pages (1/250 by default). Documents exceeding this error rate are discarded.

Filtering

--apply_filter
boolean
Apply automatic filtering to skip:
  • Non-English documents
  • Form documents
  • Likely SEO spam content
python -m olmocr.pipeline ./localworkspace --pdfs *.pdf --apply_filter

Example Configurations

High Quality Processing

For academic papers or documents with dense text:
python -m olmocr.pipeline ./localworkspace \
  --pdfs academic_papers/*.pdf \
  --target_longest_image_dim 1536 \
  --target_anchor_text_len 8000 \
  --workers 4

Fast Processing

For simple documents where speed is priority:
python -m olmocr.pipeline ./localworkspace \
  --pdfs simple_docs/*.pdf \
  --target_longest_image_dim 768 \
  --workers 16 \
  --apply_filter

Memory-Constrained GPU

For GPUs with limited memory (e.g., 24GB):
python -m olmocr.pipeline ./localworkspace \
  --pdfs documents/*.pdf \
  --workers 2 \
  --target_longest_image_dim 1024

Understanding the Output

Results are saved in Dolma format as JSONL files:
./localworkspace/
├── results/
│   ├── output_abc123.jsonl
│   ├── output_def456.jsonl
│   └── ...
└── work_queue/
    └── ...
Each JSONL file contains documents with:
  • id: SHA-1 hash of the document text
  • text: Extracted text content
  • metadata: Processing statistics (tokens, pages, etc.)
  • attributes.pdf_page_numbers: Character spans for each page
See Viewing Results for how to visualize and analyze the output.

Monitoring Progress

The pipeline provides real-time progress information:
2026-03-03 10:15:23 - Pipeline started with PID 12345
2026-03-03 10:15:45 - sglang server is ready
2026-03-03 10:15:46 - Worker 0 processing work item abc123
2026-03-03 10:15:48 - Got 15 pages to do for s3://bucket/doc.pdf in worker 0
2026-03-03 10:16:02 - Got 1 docs for abc123
Logs are saved to olmocr-pipeline-debug.log in the current directory.

Troubleshooting

Out of Memory Errors

If you encounter GPU out-of-memory errors:
  1. Reduce --workers to 1 or 2
  2. Decrease --target_longest_image_dim to 768 or 512
  3. Ensure no other processes are using the GPU

Server Connection Errors

If workers can’t connect to the sglang server:
  1. Check that port 30024 is not in use
  2. Verify the model loaded successfully in the logs
  3. Ensure sglang is installed correctly (see Installation)

Page Processing Failures

When pages fail to process:
  • The pipeline automatically retries up to --max_page_retries times
  • If rotation is detected as incorrect, it will retry with rotation correction
  • After max retries, falls back to simple pdftotext extraction
  • Documents with too many failures (exceeding --max_page_error_rate) are discarded

Next Steps

View Results

Learn how to visualize and analyze your converted documents

Cluster Processing

Scale up to process millions of PDFs using multiple nodes

Build docs developers (and LLMs) love