olmocr.pipeline module provides the core functionality for running batch inference on PDF documents using vision-language models.
Overview
The pipeline module orchestrates the entire OCR workflow:- Downloads PDFs from S3 or local storage
- Renders PDF pages to images
- Extracts anchor text for context
- Runs vision-language model inference
- Outputs results in Dolma document format
Command Line Interface
Run the pipeline using:Required Arguments
The filesystem path where work will be stored. Can be a local folder or an S3 path for coordinating work across multiple workers.Example:
s3://bucket/prefix/ or /local/path/workspacePDF Input Options
Path(s) to add PDFs stored in S3 to the workspace. Can be:
- S3 glob pattern:
s3://bucket/prefix/*.pdf - Path to file containing list of PDF paths
- Direct path to a single PDF file
--pdfs s3://my-bucket/documents/*.pdfS3 configuration profile for accessing the workspace.Default:
None (uses default AWS credentials)S3 configuration profile for accessing the raw PDF documents.Default:
None (uses default AWS credentials)Work Queue Configuration
Target number of PDF pages per work item group. The pipeline samples PDFs to estimate average page count and groups work items accordingly.Default:
500Maximum number of times to retry rendering a page if it fails or produces invalid results.Default:
8Maximum rate of allowable failed pages in a document. Documents exceeding this rate are discarded.Default:
0.004 (1/250 pages)Number of concurrent workers to run at a time. Each worker processes a batch of PDFs in parallel.Default:
8Apply basic filtering to keep only English PDFs which are not forms and not likely SEO spam.Default:
FalseModel Configuration
Path or identifier for the vision-language model. Can be a HuggingFace model ID or a custom path.Default:
"allenai/olmOCR-7B-0225-preview"Maximum context length that the model was fine-tuned under.Default:
8192Chat template to pass to the SGLang server for formatting prompts.Default:
"qwen2-vl"Dimension on the longest side to use for rendering PDF pages to images.Default:
1024Maximum amount of anchor text to extract (in characters) for providing context to the model.Default:
6000Beaker Integration
Submit this job to Beaker instead of running locally.Default:
FalseBeaker workspace to submit jobs to.Default:
"ai2/olmocr"Beaker cluster(s) you want to run on.Default:
["ai2/jupiter-cirrascale-2", "ai2/ceres-cirrascale", ...]Number of GPU replicas to run.Default:
1Beaker priority level for the job.Default:
"normal"Statistics
Instead of running any job, report statistics about the current workspace including completed items, token counts, and progress.Default:
FalseCore Classes
PageResult
Represents the result of processing a single PDF page.Original S3 path or local path to the PDF document.
Page number within the PDF (1-indexed).
The structured response from the vision-language model containing:
natural_text: Extracted text contentprimary_language: Detected languageis_rotation_valid: Whether page rotation is correctrotation_correction: Degrees to rotate if invalidis_table: Whether page contains tablesis_diagram: Whether page contains diagrams
Number of input tokens used for this page.
Number of output tokens generated for this page.
Whether this page used fallback extraction (pdftotext) due to processing failures.
Key Functions
build_page_query
Builds a query payload for processing a single PDF page.Path to the PDF file on local disk.
Page number to process (1-indexed).
Target dimension for the longest side of the rendered image.
Maximum characters of anchor text to extract.
Rotation angle to apply to the image (0, 90, 180, or 270 degrees).Default:
0process_page
Processes a single PDF page through the inference pipeline.process_pdf
Processes an entire PDF document.None if processing failed or document was filtered out.
build_dolma_document
Constructs a Dolma-format document from page results.Usage Examples
Basic Local Processing
S3 Workspace with Multiple Workers
Using Custom Model
Submitting to Beaker
Checking Workspace Statistics
Architecture
The pipeline uses an async architecture with:- SGLang Server: Manages the vision-language model inference
- Work Queue: Distributes PDF processing tasks across workers
- Workers: Process PDFs concurrently, each handling multiple pages in parallel
- Process Pool: Offloads CPU-bound tasks (anchor text extraction)
- Metrics System: Tracks token usage and throughput
Workflow
Error Handling
The pipeline implements robust error handling:- Page-level retries: Up to
--max_page_retriesattempts per page - Rotation correction: Automatically detects and corrects rotated pages
- Fallback extraction: Uses pdftotext when VLM processing fails
- Document-level filtering: Discards documents exceeding error rate threshold
- Exponential backoff: For server connection issues
- Graceful degradation: Continues processing other documents on failure
Performance Considerations
- Concurrent processing: Multiple workers process PDFs in parallel
- Async I/O: Non-blocking downloads and uploads
- Process pool: CPU-bound anchor text extraction runs in separate processes
- Semaphore control: Prevents queue saturation while maximizing GPU utilization
- Batch grouping: Groups PDFs by estimated page count for balanced workloads
Related
- Work Queue API - Queue management system
- Rendering API - PDF rendering utilities