Overview
olmOCR outputs converted PDFs in Dolma format , a structured JSONL format designed for large-scale text processing. This guide covers how to access, analyze, and visualize your results.
File Structure
Results are saved as JSONL (JSON Lines) files:
workspace/
└── results/
├── output_abc123.jsonl
├── output_def456.jsonl
└── output_xyz789.jsonl
Each file contains one JSON object per line, representing a processed document.
Document Schema
Each Dolma document has this structure:
{
"id" : "a1b2c3d4e5f6..." ,
"text" : "Full extracted text from all pages..." ,
"source" : "olmocr" ,
"added" : "2026-03-03" ,
"created" : "2026-03-03" ,
"metadata" : {
"Source-File" : "s3://bucket/path/document.pdf" ,
"olmocr-version" : "0.2.5" ,
"pdf-total-pages" : 12 ,
"total-input-tokens" : 45678 ,
"total-output-tokens" : 8234 ,
"total-fallback-pages" : 0
},
"attributes" : {
"pdf_page_numbers" : [
[ 0 , 1234 , 1 ],
[ 1234 , 2456 , 2 ],
[ 2456 , 3789 , 3 ]
]
}
}
Field Descriptions
SHA-1 hash of the document text, providing a unique identifier for deduplication.
The complete extracted text from all pages, with pages separated by newlines.
Always set to "olmocr" for documents processed by olmOCR.
Original path to the PDF file (local path or S3 URI).
Version of olmOCR used to process the document.
Total number of pages in the PDF.
metadata.total-input-tokens
Total input tokens used for processing all pages (images + anchor text).
metadata.total-output-tokens
Total output tokens generated by the model for all pages.
metadata.total-fallback-pages
Number of pages that failed model processing and used fallback pdftotext extraction.
attributes.pdf_page_numbers
Array of [start_char, end_char, page_number] tuples indicating which character ranges in the text correspond to which PDF pages. For example, [0, 1234, 1] means characters 0-1234 in the text came from page 1.
Reading Output Files
Using Command Line
View the raw JSONL output:
# View all results
cat localworkspace/results/output_ * .jsonl
# View a single document (pretty-printed)
head -n 1 localworkspace/results/output_abc123.jsonl | jq .
# Extract just the text
jq -r '.text' localworkspace/results/output_ * .jsonl
# Count total documents
cat localworkspace/results/output_ * .jsonl | wc -l
Using Python
Read and process results programmatically:
import json
import glob
# Read all output files
for filepath in glob.glob( "localworkspace/results/output_*.jsonl" ):
with open (filepath, 'r' ) as f:
for line in f:
doc = json.loads(line)
# Access document fields
doc_id = doc[ 'id' ]
text = doc[ 'text' ]
source_file = doc[ 'metadata' ][ 'Source-File' ]
num_pages = doc[ 'metadata' ][ 'pdf-total-pages' ]
print ( f "Document: { source_file } " )
print ( f "Pages: { num_pages } " )
print ( f "Preview: { text[: 200 ] } ... \n " )
Reading from S3
For cluster workspaces, read directly from S3:
import boto3
import json
import smart_open
s3_path = "s3://my-bucket/workspace/results/output_abc123.jsonl"
with smart_open.smart_open(s3_path, 'r' ) as f:
for line in f:
doc = json.loads(line)
print (doc[ 'metadata' ][ 'Source-File' ])
Using the HTML Viewer
The dolmaviewer tool creates side-by-side HTML visualizations showing the original PDF pages alongside extracted text.
Basic Usage
Generate HTML previews
python -m olmocr.viewer.dolmaviewer localworkspace/results/output_ * .jsonl
Open in browser
HTML files are saved to ./dolma_previews/: open dolma_previews/tests_gnarly_pdfs_horribleocr_pdf.html
Viewer Features
The HTML viewer displays:
Side-by-side view : Original PDF page images on the left, extracted text on the right
Page navigation : Scroll through all pages in a single HTML file
Formatted text : Markdown tables and formatting are rendered as HTML
Page numbers : Clear indication of which text corresponds to which page
Viewer Options
Path(s) to JSONL files. Supports glob patterns for local files and S3 URIs. # Single file
python -m olmocr.viewer.dolmaviewer output.jsonl
# Multiple files with glob
python -m olmocr.viewer.dolmaviewer "results/*.jsonl"
# S3 path
python -m olmocr.viewer.dolmaviewer s3://bucket/workspace/results/output_ * .jsonl
--output_dir
string
default: "dolma_previews"
Directory where HTML files will be saved. python -m olmocr.viewer.dolmaviewer output.jsonl --output_dir ./previews
--template_path
string
default: "dolmaviewer_template.html"
Path to custom Jinja2 template file for HTML generation.
AWS profile name for accessing S3 source documents. Required when viewing results from S3-based workspaces. python -m olmocr.viewer.dolmaviewer \
s3://bucket/workspace/results/output_ * .jsonl \
--s3_profile my-aws-profile
Viewing S3 Results
For cluster-based processing, the viewer can fetch PDFs from S3:
python -m olmocr.viewer.dolmaviewer \
s3://my-bucket/workspace/results/output_ * .jsonl \
--s3_profile production \
--output_dir ./s3_previews
The viewer generates pre-signed S3 URLs (valid for 1 week) for the original PDF files, allowing you to download them directly from the HTML viewer.
Analyzing Results
Get processing statistics:
import json
import glob
total_docs = 0
total_pages = 0
total_output_tokens = 0
total_fallback_pages = 0
for filepath in glob.glob( "workspace/results/output_*.jsonl" ):
with open (filepath, 'r' ) as f:
for line in f:
doc = json.loads(line)
total_docs += 1
total_pages += doc[ 'metadata' ][ 'pdf-total-pages' ]
total_output_tokens += doc[ 'metadata' ][ 'total-output-tokens' ]
total_fallback_pages += doc[ 'metadata' ][ 'total-fallback-pages' ]
print ( f "Total documents: { total_docs :,} " )
print ( f "Total pages: { total_pages :,} " )
print ( f "Total output tokens: { total_output_tokens :,} " )
print ( f "Average tokens per page: { total_output_tokens / total_pages :.1f} " )
print ( f "Fallback pages: { total_fallback_pages :,} ( { 100 * total_fallback_pages / total_pages :.2f} %)" )
Finding Long Documents
Identify documents with many tokens:
import json
import glob
long_docs = []
THRESHOLD = 32768 # tokens
for filepath in glob.glob( "workspace/results/output_*.jsonl" ):
with open (filepath, 'r' ) as f:
for line in f:
doc = json.loads(line)
tokens = doc[ 'metadata' ][ 'total-output-tokens' ]
if tokens > THRESHOLD :
long_docs.append({
'source' : doc[ 'metadata' ][ 'Source-File' ],
'tokens' : tokens,
'pages' : doc[ 'metadata' ][ 'pdf-total-pages' ]
})
long_docs.sort( key = lambda x : x[ 'tokens' ], reverse = True )
print ( f "Found { len (long_docs) } documents with > { THRESHOLD :,} tokens" )
for doc in long_docs[: 10 ]:
print ( f " { doc[ 'tokens' ] :,} tokens, { doc[ 'pages' ] } pages: { doc[ 'source' ] } " )
Identifying Problematic Documents
Find documents with high fallback rates:
import json
import glob
problematic = []
for filepath in glob.glob( "workspace/results/output_*.jsonl" ):
with open (filepath, 'r' ) as f:
for line in f:
doc = json.loads(line)
total_pages = doc[ 'metadata' ][ 'pdf-total-pages' ]
fallback_pages = doc[ 'metadata' ][ 'total-fallback-pages' ]
if total_pages > 0 :
fallback_rate = fallback_pages / total_pages
if fallback_rate > 0.1 : # More than 10% fallback
problematic.append({
'source' : doc[ 'metadata' ][ 'Source-File' ],
'fallback_rate' : fallback_rate,
'total_pages' : total_pages
})
problematic.sort( key = lambda x : x[ 'fallback_rate' ], reverse = True )
print ( f "Found { len (problematic) } documents with >10% fallback rate" )
for doc in problematic[: 10 ]:
print ( f " { doc[ 'fallback_rate' ] * 100 :.1f} % fallback ( { doc[ 'total_pages' ] } pages): { doc[ 'source' ] } " )
Working with Page Spans
Extract text for specific pages:
import json
def get_page_text ( doc , page_number ):
"""Extract text for a specific page from a Dolma document."""
for start, end, page_num in doc[ 'attributes' ][ 'pdf_page_numbers' ]:
if page_num == page_number:
return doc[ 'text' ][start:end]
return None
# Example usage
with open ( 'workspace/results/output_abc123.jsonl' , 'r' ) as f:
for line in f:
doc = json.loads(line)
# Get text from page 1
page1_text = get_page_text(doc, 1 )
print ( f "Page 1 text: { page1_text[: 200 ] } ..." )
# Iterate through all pages
for start, end, page_num in doc[ 'attributes' ][ 'pdf_page_numbers' ]:
page_text = doc[ 'text' ][start:end]
print ( f "Page { page_num } : { len (page_text) } characters" )
Exporting Results
Export to Plain Text
Convert all documents to plain text files:
import json
import glob
import os
output_dir = "exported_texts"
os.makedirs(output_dir, exist_ok = True )
for filepath in glob.glob( "workspace/results/output_*.jsonl" ):
with open (filepath, 'r' ) as f:
for line in f:
doc = json.loads(line)
# Create filename from document ID
filename = f " { doc[ 'id' ] } .txt"
output_path = os.path.join(output_dir, filename)
# Write text to file
with open (output_path, 'w' ) as out:
out.write(doc[ 'text' ])
print ( f "Exported texts to { output_dir } /" )
Export to CSV
Create a CSV index of all documents:
import json
import glob
import csv
with open ( 'document_index.csv' , 'w' , newline = '' ) as csvfile:
writer = csv.writer(csvfile)
writer.writerow([ 'ID' , 'Source File' , 'Pages' , 'Output Tokens' , 'Fallback Pages' ])
for filepath in glob.glob( "workspace/results/output_*.jsonl" ):
with open (filepath, 'r' ) as f:
for line in f:
doc = json.loads(line)
writer.writerow([
doc[ 'id' ],
doc[ 'metadata' ][ 'Source-File' ],
doc[ 'metadata' ][ 'pdf-total-pages' ],
doc[ 'metadata' ][ 'total-output-tokens' ],
doc[ 'metadata' ][ 'total-fallback-pages' ]
])
print ( "Created document_index.csv" )
Next Steps
Local Usage Learn how to process PDFs on a single machine
Cluster Usage Scale up to process millions of PDFs