Overview
This guide covers how to use olmOCR locally to convert PDFs to structured text. Local usage is ideal for testing, small batches, or when you have a single GPU machine available.Prerequisites
Before you begin, ensure you have:- A recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100)
- At least 30GB of free disk space
- GPU with sufficient memory (40GB+ recommended for optimal performance)
- Properly configured olmOCR environment (see Installation)
Setting Up Your Local Workspace
The pipeline uses a local workspace directory to store:- Temporary processing files
- Work queue state
- Output results in Dolma format
Verify your GPU is available
The pipeline automatically checks for GPU availability. If no GPU is detected, you’ll see an error message.
Converting PDFs
Single PDF Conversion
Convert a single PDF document:- Start an sglang inference server on port 30024
- Process each page of the PDF
- Save results to
./localworkspace/results/output_*.jsonl
Multiple PDF Conversion
Convert multiple PDFs using glob patterns:Converting from a File List
If you have a text file with one PDF path per line:Configuration Options
Model Configuration
Path to the model on Hugging Face or local filesystem. The model will be downloaded and cached automatically.
Maximum context length for the model. Requests exceeding this will automatically reduce anchor text length.
Chat template format to use with the sglang server. Must match your model’s expected format.
Image Processing
The longest dimension (width or height) for rendered PDF page images. Higher values provide better quality but require more GPU memory and processing time.Recommendations:
- 1024: Good balance for most documents (default)
- 1536: Better quality for dense or small text
- 768: Faster processing for simple documents
Maximum characters of anchor text to provide as context to the model. Anchor text is extracted using pdftotext and helps the model understand the page content.This value is automatically reduced if the total context exceeds
--model_max_context.Processing Control
Number of concurrent workers processing PDFs. More workers can improve throughput but also increase memory usage.Recommendations:
- 8-16: Single GPU with good memory (40GB+)
- 4-8: GPUs with limited memory (24GB)
- 1-4: For debugging or memory-constrained environments
Target number of PDF pages per work item group. The pipeline samples your PDFs to estimate pages per document and groups work accordingly.
Maximum retry attempts for processing a page before falling back to simple text extraction.
Maximum allowable rate of failed pages (1/250 by default). Documents exceeding this error rate are discarded.
Filtering
Apply automatic filtering to skip:
- Non-English documents
- Form documents
- Likely SEO spam content
Example Configurations
High Quality Processing
For academic papers or documents with dense text:Fast Processing
For simple documents where speed is priority:Memory-Constrained GPU
For GPUs with limited memory (e.g., 24GB):Understanding the Output
Results are saved in Dolma format as JSONL files:id: SHA-1 hash of the document texttext: Extracted text contentmetadata: Processing statistics (tokens, pages, etc.)attributes.pdf_page_numbers: Character spans for each page
Monitoring Progress
The pipeline provides real-time progress information:olmocr-pipeline-debug.log in the current directory.
Troubleshooting
Out of Memory Errors
If you encounter GPU out-of-memory errors:- Reduce
--workersto 1 or 2 - Decrease
--target_longest_image_dimto 768 or 512 - Ensure no other processes are using the GPU
Server Connection Errors
If workers can’t connect to the sglang server:- Check that port 30024 is not in use
- Verify the model loaded successfully in the logs
- Ensure sglang is installed correctly (see Installation)
Page Processing Failures
When pages fail to process:- The pipeline automatically retries up to
--max_page_retriestimes - If rotation is detected as incorrect, it will retry with rotation correction
- After max retries, falls back to simple pdftotext extraction
- Documents with too many failures (exceeding
--max_page_error_rate) are discarded
Next Steps
View Results
Learn how to visualize and analyze your converted documents
Cluster Processing
Scale up to process millions of PDFs using multiple nodes