Pipeline Module

The olmocr.pipeline module provides the core functionality for running batch inference on PDF documents using vision-language models.

Overview

The pipeline module orchestrates the entire OCR workflow:

Downloads PDFs from S3 or local storage
Renders PDF pages to images
Extracts anchor text for context
Runs vision-language model inference
Outputs results in Dolma document format

Command Line Interface

Run the pipeline using:

python -m olmocr.pipeline <workspace> [options]

Required Arguments

workspace

string

required

The filesystem path where work will be stored. Can be a local folder or an S3 path for coordinating work across multiple workers.Example: s3://bucket/prefix/ or /local/path/workspace

PDF Input Options

--pdfs

list[string]

Path(s) to add PDFs stored in S3 to the workspace. Can be:

S3 glob pattern: s3://bucket/prefix/*.pdf
Path to file containing list of PDF paths
Direct path to a single PDF file

Example: --pdfs s3://my-bucket/documents/*.pdf

--workspace_profile

string

S3 configuration profile for accessing the workspace.Default: None (uses default AWS credentials)

--pdf_profile

string

S3 configuration profile for accessing the raw PDF documents.Default: None (uses default AWS credentials)

Work Queue Configuration

--pages_per_group

int

Target number of PDF pages per work item group. The pipeline samples PDFs to estimate average page count and groups work items accordingly.Default: 500

--max_page_retries

int

Maximum number of times to retry rendering a page if it fails or produces invalid results.Default: 8

--max_page_error_rate

float

Maximum rate of allowable failed pages in a document. Documents exceeding this rate are discarded.Default: 0.004 (1/250 pages)

--workers

int

Number of concurrent workers to run at a time. Each worker processes a batch of PDFs in parallel.Default: 8

--apply_filter

boolean

Apply basic filtering to keep only English PDFs which are not forms and not likely SEO spam.Default: False

Model Configuration

--model

string

Path or identifier for the vision-language model. Can be a HuggingFace model ID or a custom path.Default: "allenai/olmOCR-7B-0225-preview"

--model_max_context

int

Maximum context length that the model was fine-tuned under.Default: 8192

--model_chat_template

string

Chat template to pass to the SGLang server for formatting prompts.Default: "qwen2-vl"

--target_longest_image_dim

int

Dimension on the longest side to use for rendering PDF pages to images.Default: 1024

--target_anchor_text_len

int

Maximum amount of anchor text to extract (in characters) for providing context to the model.Default: 6000

Beaker Integration

--beaker

boolean

Submit this job to Beaker instead of running locally.Default: False

--beaker_workspace

string

Beaker workspace to submit jobs to.Default: "ai2/olmocr"

--beaker_cluster

list[string]

Beaker cluster(s) you want to run on.Default: ["ai2/jupiter-cirrascale-2", "ai2/ceres-cirrascale", ...]

--beaker_gpus

int

Number of GPU replicas to run.Default: 1

--beaker_priority

string

Beaker priority level for the job.Default: "normal"

Statistics

--stats

boolean

Instead of running any job, report statistics about the current workspace including completed items, token counts, and progress.Default: False

Core Classes

PageResult

Represents the result of processing a single PDF page.

s3_path

string

Original S3 path or local path to the PDF document.

page_num

int

Page number within the PDF (1-indexed).

response

PageResponse

The structured response from the vision-language model containing:

natural_text: Extracted text content
primary_language: Detected language
is_rotation_valid: Whether page rotation is correct
rotation_correction: Degrees to rotate if invalid
is_table: Whether page contains tables
is_diagram: Whether page contains diagrams

input_tokens

int

Number of input tokens used for this page.

output_tokens

int

Number of output tokens generated for this page.

is_fallback

boolean

Whether this page used fallback extraction (pdftotext) due to processing failures.

Key Functions

build_page_query

Builds a query payload for processing a single PDF page.

async def build_page_query(
    local_pdf_path: str,
    page: int,
    target_longest_image_dim: int,
    target_anchor_text_len: int,
    image_rotation: int = 0
) -> dict

local_pdf_path

string

required

Path to the PDF file on local disk.

page

int

required

Page number to process (1-indexed).

target_longest_image_dim

int

required

Target dimension for the longest side of the rendered image.

target_anchor_text_len

int

required

Maximum characters of anchor text to extract.

image_rotation

int

Rotation angle to apply to the image (0, 90, 180, or 270 degrees).Default: 0

Returns: Dictionary containing the API request payload with model, messages, max_tokens, and temperature.

process_page

Processes a single PDF page through the inference pipeline.

async def process_page(
    args,
    worker_id: int,
    pdf_orig_path: str,
    pdf_local_path: str,
    page_num: int
) -> PageResult

Handles retries, rotation correction, and fallback to pdftotext if processing fails.

process_pdf

Processes an entire PDF document.

async def process_pdf(
    args,
    worker_id: int,
    pdf_orig_path: str
) -> Optional[dict]

Downloads the PDF, processes all pages concurrently, applies filtering, and builds a Dolma document. Returns: Dolma document dict or None if processing failed or document was filtered out.

build_dolma_document

Constructs a Dolma-format document from page results.

def build_dolma_document(
    pdf_orig_path: str,
    page_results: List[PageResult]
) -> Optional[dict]

Returns: Dictionary with structure:

{
  "id": "<sha1-hash>",
  "text": "<full-document-text>",
  "source": "olmocr",
  "added": "2024-01-01",
  "created": "2024-01-01",
  "metadata": {
    "Source-File": "s3://...",
    "olmocr-version": "0.1.0",
    "pdf-total-pages": 10,
    "total-input-tokens": 5000,
    "total-output-tokens": 3000,
    "total-fallback-pages": 0
  },
  "attributes": {
    "pdf_page_numbers": [[0, 100, 1], [100, 200, 2], ...]
  }
}

Usage Examples

Basic Local Processing

python -m olmocr.pipeline ./workspace \
  --pdfs document.pdf \
  --workers 4

S3 Workspace with Multiple Workers

python -m olmocr.pipeline s3://my-bucket/workspace \
  --pdfs "s3://pdf-bucket/documents/*.pdf" \
  --workers 16 \
  --pages_per_group 1000

Using Custom Model

python -m olmocr.pipeline ./workspace \
  --pdfs "s3://my-pdfs/*.pdf" \
  --model "allenai/custom-ocr-model" \
  --model_max_context 16384 \
  --target_longest_image_dim 2048

Submitting to Beaker

python -m olmocr.pipeline s3://my-bucket/workspace \
  --pdfs "s3://pdf-bucket/documents/*.pdf" \
  --beaker \
  --beaker_gpus 8 \
  --beaker_workspace "ai2/olmocr" \
  --beaker_priority "high"

Checking Workspace Statistics

python -m olmocr.pipeline s3://my-bucket/workspace --stats

Output:

Work Items Status:
Total work items: 1,000
Completed items: 750
Remaining items: 250

Results:
Total documents processed: 7,200
Total pages processed: 72,000
Average pages per doc: 10.0
Average output tokens per doc: 4,250.5

Architecture

The pipeline uses an async architecture with:

SGLang Server: Manages the vision-language model inference
Work Queue: Distributes PDF processing tasks across workers
Workers: Process PDFs concurrently, each handling multiple pages in parallel
Process Pool: Offloads CPU-bound tasks (anchor text extraction)
Metrics System: Tracks token usage and throughput

Workflow

Error Handling

The pipeline implements robust error handling:

Page-level retries: Up to --max_page_retries attempts per page
Rotation correction: Automatically detects and corrects rotated pages
Fallback extraction: Uses pdftotext when VLM processing fails
Document-level filtering: Discards documents exceeding error rate threshold
Exponential backoff: For server connection issues
Graceful degradation: Continues processing other documents on failure

Performance Considerations

Concurrent processing: Multiple workers process PDFs in parallel
Async I/O: Non-blocking downloads and uploads
Process pool: CPU-bound anchor text extraction runs in separate processes
Semaphore control: Prevents queue saturation while maximizing GPU utilization
Batch grouping: Groups PDFs by estimated page count for balanced workloads

Work Queue API - Queue management system
Rendering API - PDF rendering utilities

Pipeline

Data Processing

Training & Evaluation

Utilities

Pipeline Module

Overview

Command Line Interface

Required Arguments

PDF Input Options

Work Queue Configuration

Model Configuration

Beaker Integration

Statistics

Core Classes

PageResult

Key Functions

build_page_query

process_page

process_pdf

build_dolma_document

Usage Examples

Basic Local Processing

S3 Workspace with Multiple Workers

Using Custom Model

Submitting to Beaker

Checking Workspace Statistics

Architecture

Workflow

Error Handling

Performance Considerations

Build docs developers (and LLMs) love

Pipeline

Data Processing

Training & Evaluation

Utilities

​Overview

​Command Line Interface

​Required Arguments

​PDF Input Options

​Work Queue Configuration

​Model Configuration

​Beaker Integration

​Statistics

​Core Classes

​PageResult

​Key Functions

​build_page_query

​process_page

​process_pdf

​build_dolma_document

​Usage Examples

​Basic Local Processing

​S3 Workspace with Multiple Workers

​Using Custom Model

​Submitting to Beaker

​Checking Workspace Statistics

​Architecture

​Workflow

​Error Handling

​Performance Considerations

​Related

Build docs developers (and LLMs) love

Overview

Command Line Interface

Required Arguments

PDF Input Options

Work Queue Configuration

Model Configuration

Beaker Integration

Statistics

Core Classes

PageResult

Key Functions

build_page_query

process_page

process_pdf

build_dolma_document

Usage Examples

Basic Local Processing

S3 Workspace with Multiple Workers

Using Custom Model

Submitting to Beaker

Checking Workspace Statistics

Architecture

Workflow

Error Handling

Performance Considerations

Related