Overview
For large-scale PDF processing, olmOCR supports distributed processing across multiple GPU nodes. This mode uses S3 for both input/output and work queue coordination, enabling you to process millions of PDFs efficiently.Architecture
Cluster mode uses:- S3 workspace: Central coordination point for work queue and results
- S3 PDF storage: Source PDFs stored in S3 buckets
- Multiple workers: Independent GPU nodes processing from the shared queue
- Work queue: Automatic work distribution and progress tracking
Prerequisites
S3-Based Processing
Setting Up a Workspace
Create a workspace in S3 and add PDFs to process:- Creates a work queue in the S3 workspace
- Samples PDFs to estimate pages per document
- Groups PDFs into work items (default: ~500 pages per group)
- Starts processing on the current node
The first time you run this command, it will populate the work queue. This can take several minutes for millions of PDFs as it samples documents and creates work groups.
Adding Worker Nodes
On any additional GPU node, simply point to the same workspace:- Automatically pull work items from the queue
- Process PDFs independently
- Upload results to
s3://my_bucket/pdfworkspaces/exampleworkspace/results/ - Mark work items as complete
Multiple PDF Sources
You can add PDFs from multiple sources:S3 Configuration
Separate S3 Profiles
If your workspace and PDFs are in different AWS accounts:AWS profile name for accessing the workspace bucket
AWS profile name for accessing the PDF source buckets
Beaker Integration (Ai2 Users)
For users at the Allen Institute for AI, olmOCR provides native Beaker integration to easily launch distributed processing jobs.Quick Start
Launch a cluster job with a single command:- Prepare the workspace locally (sample PDFs, create work queue)
- Submit a Beaker experiment with 4 GPU replicas
- Each replica will independently process work items from the queue
- Return a Beaker experiment URL for monitoring
Beaker Configuration Options
Enable Beaker submission mode. The pipeline will prepare the workspace locally, then submit to Beaker instead of running inference locally.
Beaker workspace to submit the experiment to.
Beaker cluster(s) to run on. Can specify multiple clusters for better availability.Default clusters:
ai2/jupiter-cirrascale-2ai2/ceres-cirrascaleai2/neptune-cirrascaleai2/saturn-cirrascaleai2/augusta-google-1
Number of GPU replicas to run in parallel. More replicas = faster processing.
Beaker job priority level:
low, normal, high, or urgent.Beaker Secrets
On first run, the pipeline will prompt you to save AWS and Weka credentials as Beaker secrets:{owner}-WEKA_ACCESS_KEY_ID{owner}-WEKA_SECRET_ACCESS_KEY{owner}-AWS_CREDENTIALS_FILE
OLMOCR_PREVIEW_HF_TOKEN: For accessing gated Hugging Face modelsOE_DATA_GCS_SA_KEY: For loading models from Google Cloud Storage
Example Beaker Workflow
Process 1 million PDFs with 20 parallel workers:Monitoring Progress
Using —stats Flag
Get detailed statistics about your workspace:Monitoring Individual Workers
Each worker outputs progress information:Beaker Web Interface
For Beaker jobs, monitor all replicas at:- View logs from each replica
- Monitor resource usage
- Check job status and failures
- Cancel or preempt jobs
Work Queue Management
How the Queue Works
- Initialization: The first worker to access a workspace populates the queue
- Work items: PDFs are grouped into items (~500 pages each by default)
- Distribution: Workers atomically claim work items from S3
- Results: Each completed item produces one
output_*.jsonlfile - Completion tracking: Work items are marked complete to prevent duplicate processing
Adjusting Work Group Size
Target number of pages per work group. Larger groups = fewer S3 operations, but less granular progress tracking.
Scaling Best Practices
Optimal Worker Count
For efficient processing:- Start with 10-20 workers for most workloads
- Monitor queue depletion rate using
--stats - Add more workers if queue is draining too slowly
- Reduce workers if approaching S3 rate limits
Cost Optimization
Handling Failures
- Worker crashes: Other workers continue processing; restart crashed workers
- Page failures: Automatic retry up to
--max_page_retriestimes - Document failures: Documents exceeding
--max_page_error_rateare skipped - Work item failures: Incomplete items remain in queue for retry
Output Structure
Results are organized in S3:output_*.jsonl file contains one or more documents in Dolma format. See Viewing Results for analysis and visualization.
Example: Processing 1M+ PDFs
Troubleshooting
S3 Access Issues
If workers can’t access S3:- Verify AWS credentials:
aws s3 ls s3://your-bucket/ - Check bucket permissions (read/write for workspace, read for PDFs)
- Use
--workspace_profileand--pdf_profileif using multiple accounts
Work Queue Not Populating
- Ensure the
--pdfsglob pattern matches files:aws s3 ls s3://bucket/path/ - Check for errors during PDF sampling phase
- Verify sufficient permissions to write to workspace
Duplicate Processing
If documents are processed multiple times:- Check that all workers use the same workspace path (exact match)
- Ensure workers aren’t restarted with
--pdfsflag (only use on first run) - Verify S3 consistency (rare issue in older regions)
Next Steps
View Results
Learn how to visualize and analyze your converted documents
Local Usage
Process PDFs on a single machine for testing and development