Skip to main content
The olmocr.pipeline module provides the core functionality for running batch inference on PDF documents using vision-language models.

Overview

The pipeline module orchestrates the entire OCR workflow:
  • Downloads PDFs from S3 or local storage
  • Renders PDF pages to images
  • Extracts anchor text for context
  • Runs vision-language model inference
  • Outputs results in Dolma document format

Command Line Interface

Run the pipeline using:
python -m olmocr.pipeline <workspace> [options]

Required Arguments

workspace
string
required
The filesystem path where work will be stored. Can be a local folder or an S3 path for coordinating work across multiple workers.Example: s3://bucket/prefix/ or /local/path/workspace

PDF Input Options

--pdfs
list[string]
Path(s) to add PDFs stored in S3 to the workspace. Can be:
  • S3 glob pattern: s3://bucket/prefix/*.pdf
  • Path to file containing list of PDF paths
  • Direct path to a single PDF file
Example: --pdfs s3://my-bucket/documents/*.pdf
--workspace_profile
string
S3 configuration profile for accessing the workspace.Default: None (uses default AWS credentials)
--pdf_profile
string
S3 configuration profile for accessing the raw PDF documents.Default: None (uses default AWS credentials)

Work Queue Configuration

--pages_per_group
int
Target number of PDF pages per work item group. The pipeline samples PDFs to estimate average page count and groups work items accordingly.Default: 500
--max_page_retries
int
Maximum number of times to retry rendering a page if it fails or produces invalid results.Default: 8
--max_page_error_rate
float
Maximum rate of allowable failed pages in a document. Documents exceeding this rate are discarded.Default: 0.004 (1/250 pages)
--workers
int
Number of concurrent workers to run at a time. Each worker processes a batch of PDFs in parallel.Default: 8
--apply_filter
boolean
Apply basic filtering to keep only English PDFs which are not forms and not likely SEO spam.Default: False

Model Configuration

--model
string
Path or identifier for the vision-language model. Can be a HuggingFace model ID or a custom path.Default: "allenai/olmOCR-7B-0225-preview"
--model_max_context
int
Maximum context length that the model was fine-tuned under.Default: 8192
--model_chat_template
string
Chat template to pass to the SGLang server for formatting prompts.Default: "qwen2-vl"
--target_longest_image_dim
int
Dimension on the longest side to use for rendering PDF pages to images.Default: 1024
--target_anchor_text_len
int
Maximum amount of anchor text to extract (in characters) for providing context to the model.Default: 6000

Beaker Integration

--beaker
boolean
Submit this job to Beaker instead of running locally.Default: False
--beaker_workspace
string
Beaker workspace to submit jobs to.Default: "ai2/olmocr"
--beaker_cluster
list[string]
Beaker cluster(s) you want to run on.Default: ["ai2/jupiter-cirrascale-2", "ai2/ceres-cirrascale", ...]
--beaker_gpus
int
Number of GPU replicas to run.Default: 1
--beaker_priority
string
Beaker priority level for the job.Default: "normal"

Statistics

--stats
boolean
Instead of running any job, report statistics about the current workspace including completed items, token counts, and progress.Default: False

Core Classes

PageResult

Represents the result of processing a single PDF page.
s3_path
string
Original S3 path or local path to the PDF document.
page_num
int
Page number within the PDF (1-indexed).
response
PageResponse
The structured response from the vision-language model containing:
  • natural_text: Extracted text content
  • primary_language: Detected language
  • is_rotation_valid: Whether page rotation is correct
  • rotation_correction: Degrees to rotate if invalid
  • is_table: Whether page contains tables
  • is_diagram: Whether page contains diagrams
input_tokens
int
Number of input tokens used for this page.
output_tokens
int
Number of output tokens generated for this page.
is_fallback
boolean
Whether this page used fallback extraction (pdftotext) due to processing failures.

Key Functions

build_page_query

Builds a query payload for processing a single PDF page.
async def build_page_query(
    local_pdf_path: str,
    page: int,
    target_longest_image_dim: int,
    target_anchor_text_len: int,
    image_rotation: int = 0
) -> dict
local_pdf_path
string
required
Path to the PDF file on local disk.
page
int
required
Page number to process (1-indexed).
target_longest_image_dim
int
required
Target dimension for the longest side of the rendered image.
target_anchor_text_len
int
required
Maximum characters of anchor text to extract.
image_rotation
int
Rotation angle to apply to the image (0, 90, 180, or 270 degrees).Default: 0
Returns: Dictionary containing the API request payload with model, messages, max_tokens, and temperature.

process_page

Processes a single PDF page through the inference pipeline.
async def process_page(
    args,
    worker_id: int,
    pdf_orig_path: str,
    pdf_local_path: str,
    page_num: int
) -> PageResult
Handles retries, rotation correction, and fallback to pdftotext if processing fails.

process_pdf

Processes an entire PDF document.
async def process_pdf(
    args,
    worker_id: int,
    pdf_orig_path: str
) -> Optional[dict]
Downloads the PDF, processes all pages concurrently, applies filtering, and builds a Dolma document. Returns: Dolma document dict or None if processing failed or document was filtered out.

build_dolma_document

Constructs a Dolma-format document from page results.
def build_dolma_document(
    pdf_orig_path: str,
    page_results: List[PageResult]
) -> Optional[dict]
Returns: Dictionary with structure:
{
  "id": "<sha1-hash>",
  "text": "<full-document-text>",
  "source": "olmocr",
  "added": "2024-01-01",
  "created": "2024-01-01",
  "metadata": {
    "Source-File": "s3://...",
    "olmocr-version": "0.1.0",
    "pdf-total-pages": 10,
    "total-input-tokens": 5000,
    "total-output-tokens": 3000,
    "total-fallback-pages": 0
  },
  "attributes": {
    "pdf_page_numbers": [[0, 100, 1], [100, 200, 2], ...]
  }
}

Usage Examples

Basic Local Processing

python -m olmocr.pipeline ./workspace \
  --pdfs document.pdf \
  --workers 4

S3 Workspace with Multiple Workers

python -m olmocr.pipeline s3://my-bucket/workspace \
  --pdfs "s3://pdf-bucket/documents/*.pdf" \
  --workers 16 \
  --pages_per_group 1000

Using Custom Model

python -m olmocr.pipeline ./workspace \
  --pdfs "s3://my-pdfs/*.pdf" \
  --model "allenai/custom-ocr-model" \
  --model_max_context 16384 \
  --target_longest_image_dim 2048

Submitting to Beaker

python -m olmocr.pipeline s3://my-bucket/workspace \
  --pdfs "s3://pdf-bucket/documents/*.pdf" \
  --beaker \
  --beaker_gpus 8 \
  --beaker_workspace "ai2/olmocr" \
  --beaker_priority "high"

Checking Workspace Statistics

python -m olmocr.pipeline s3://my-bucket/workspace --stats
Output:
Work Items Status:
Total work items: 1,000
Completed items: 750
Remaining items: 250

Results:
Total documents processed: 7,200
Total pages processed: 72,000
Average pages per doc: 10.0
Average output tokens per doc: 4,250.5

Architecture

The pipeline uses an async architecture with:
  1. SGLang Server: Manages the vision-language model inference
  2. Work Queue: Distributes PDF processing tasks across workers
  3. Workers: Process PDFs concurrently, each handling multiple pages in parallel
  4. Process Pool: Offloads CPU-bound tasks (anchor text extraction)
  5. Metrics System: Tracks token usage and throughput

Workflow

Error Handling

The pipeline implements robust error handling:
  • Page-level retries: Up to --max_page_retries attempts per page
  • Rotation correction: Automatically detects and corrects rotated pages
  • Fallback extraction: Uses pdftotext when VLM processing fails
  • Document-level filtering: Discards documents exceeding error rate threshold
  • Exponential backoff: For server connection issues
  • Graceful degradation: Continues processing other documents on failure

Performance Considerations

  • Concurrent processing: Multiple workers process PDFs in parallel
  • Async I/O: Non-blocking downloads and uploads
  • Process pool: CPU-bound anchor text extraction runs in separate processes
  • Semaphore control: Prevents queue saturation while maximizing GPU utilization
  • Batch grouping: Groups PDFs by estimated page count for balanced workloads

Build docs developers (and LLMs) love