Skip to main content

Overview

olmOCR outputs converted PDFs in Dolma format, a structured JSONL format designed for large-scale text processing. This guide covers how to access, analyze, and visualize your results.

Dolma Output Format

File Structure

Results are saved as JSONL (JSON Lines) files:
workspace/
└── results/
    ├── output_abc123.jsonl
    ├── output_def456.jsonl
    └── output_xyz789.jsonl
Each file contains one JSON object per line, representing a processed document.

Document Schema

Each Dolma document has this structure:
{
  "id": "a1b2c3d4e5f6...",
  "text": "Full extracted text from all pages...",
  "source": "olmocr",
  "added": "2026-03-03",
  "created": "2026-03-03",
  "metadata": {
    "Source-File": "s3://bucket/path/document.pdf",
    "olmocr-version": "0.2.5",
    "pdf-total-pages": 12,
    "total-input-tokens": 45678,
    "total-output-tokens": 8234,
    "total-fallback-pages": 0
  },
  "attributes": {
    "pdf_page_numbers": [
      [0, 1234, 1],
      [1234, 2456, 2],
      [2456, 3789, 3]
    ]
  }
}

Field Descriptions

id
string
SHA-1 hash of the document text, providing a unique identifier for deduplication.
text
string
The complete extracted text from all pages, with pages separated by newlines.
source
string
Always set to "olmocr" for documents processed by olmOCR.
metadata.Source-File
string
Original path to the PDF file (local path or S3 URI).
metadata.olmocr-version
string
Version of olmOCR used to process the document.
metadata.pdf-total-pages
integer
Total number of pages in the PDF.
metadata.total-input-tokens
integer
Total input tokens used for processing all pages (images + anchor text).
metadata.total-output-tokens
integer
Total output tokens generated by the model for all pages.
metadata.total-fallback-pages
integer
Number of pages that failed model processing and used fallback pdftotext extraction.
attributes.pdf_page_numbers
array
Array of [start_char, end_char, page_number] tuples indicating which character ranges in the text correspond to which PDF pages.For example, [0, 1234, 1] means characters 0-1234 in the text came from page 1.

Reading Output Files

Using Command Line

View the raw JSONL output:
# View all results
cat localworkspace/results/output_*.jsonl

# View a single document (pretty-printed)
head -n 1 localworkspace/results/output_abc123.jsonl | jq .

# Extract just the text
jq -r '.text' localworkspace/results/output_*.jsonl

# Count total documents
cat localworkspace/results/output_*.jsonl | wc -l

Using Python

Read and process results programmatically:
import json
import glob

# Read all output files
for filepath in glob.glob("localworkspace/results/output_*.jsonl"):
    with open(filepath, 'r') as f:
        for line in f:
            doc = json.loads(line)
            
            # Access document fields
            doc_id = doc['id']
            text = doc['text']
            source_file = doc['metadata']['Source-File']
            num_pages = doc['metadata']['pdf-total-pages']
            
            print(f"Document: {source_file}")
            print(f"Pages: {num_pages}")
            print(f"Preview: {text[:200]}...\n")

Reading from S3

For cluster workspaces, read directly from S3:
import boto3
import json
import smart_open

s3_path = "s3://my-bucket/workspace/results/output_abc123.jsonl"

with smart_open.smart_open(s3_path, 'r') as f:
    for line in f:
        doc = json.loads(line)
        print(doc['metadata']['Source-File'])

Using the HTML Viewer

The dolmaviewer tool creates side-by-side HTML visualizations showing the original PDF pages alongside extracted text.

Basic Usage

1

Generate HTML previews

python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl
2

Open in browser

HTML files are saved to ./dolma_previews/:
open dolma_previews/tests_gnarly_pdfs_horribleocr_pdf.html

Viewer Features

The HTML viewer displays:
  • Side-by-side view: Original PDF page images on the left, extracted text on the right
  • Page navigation: Scroll through all pages in a single HTML file
  • Formatted text: Markdown tables and formatting are rendered as HTML
  • Page numbers: Clear indication of which text corresponds to which page
HTML Viewer Example

Viewer Options

jsonl_paths
string | list
required
Path(s) to JSONL files. Supports glob patterns for local files and S3 URIs.
# Single file
python -m olmocr.viewer.dolmaviewer output.jsonl

# Multiple files with glob
python -m olmocr.viewer.dolmaviewer "results/*.jsonl"

# S3 path
python -m olmocr.viewer.dolmaviewer s3://bucket/workspace/results/output_*.jsonl
--output_dir
string
default:"dolma_previews"
Directory where HTML files will be saved.
python -m olmocr.viewer.dolmaviewer output.jsonl --output_dir ./previews
--template_path
string
default:"dolmaviewer_template.html"
Path to custom Jinja2 template file for HTML generation.
--s3_profile
string
AWS profile name for accessing S3 source documents. Required when viewing results from S3-based workspaces.
python -m olmocr.viewer.dolmaviewer \
  s3://bucket/workspace/results/output_*.jsonl \
  --s3_profile my-aws-profile

Viewing S3 Results

For cluster-based processing, the viewer can fetch PDFs from S3:
python -m olmocr.viewer.dolmaviewer \
  s3://my-bucket/workspace/results/output_*.jsonl \
  --s3_profile production \
  --output_dir ./s3_previews
The viewer generates pre-signed S3 URLs (valid for 1 week) for the original PDF files, allowing you to download them directly from the HTML viewer.

Analyzing Results

Extracting Statistics

Get processing statistics:
import json
import glob

total_docs = 0
total_pages = 0
total_output_tokens = 0
total_fallback_pages = 0

for filepath in glob.glob("workspace/results/output_*.jsonl"):
    with open(filepath, 'r') as f:
        for line in f:
            doc = json.loads(line)
            total_docs += 1
            total_pages += doc['metadata']['pdf-total-pages']
            total_output_tokens += doc['metadata']['total-output-tokens']
            total_fallback_pages += doc['metadata']['total-fallback-pages']

print(f"Total documents: {total_docs:,}")
print(f"Total pages: {total_pages:,}")
print(f"Total output tokens: {total_output_tokens:,}")
print(f"Average tokens per page: {total_output_tokens/total_pages:.1f}")
print(f"Fallback pages: {total_fallback_pages:,} ({100*total_fallback_pages/total_pages:.2f}%)")

Finding Long Documents

Identify documents with many tokens:
import json
import glob

long_docs = []
THRESHOLD = 32768  # tokens

for filepath in glob.glob("workspace/results/output_*.jsonl"):
    with open(filepath, 'r') as f:
        for line in f:
            doc = json.loads(line)
            tokens = doc['metadata']['total-output-tokens']
            if tokens > THRESHOLD:
                long_docs.append({
                    'source': doc['metadata']['Source-File'],
                    'tokens': tokens,
                    'pages': doc['metadata']['pdf-total-pages']
                })

long_docs.sort(key=lambda x: x['tokens'], reverse=True)

print(f"Found {len(long_docs)} documents with >{THRESHOLD:,} tokens")
for doc in long_docs[:10]:
    print(f"{doc['tokens']:,} tokens, {doc['pages']} pages: {doc['source']}")

Identifying Problematic Documents

Find documents with high fallback rates:
import json
import glob

problematic = []

for filepath in glob.glob("workspace/results/output_*.jsonl"):
    with open(filepath, 'r') as f:
        for line in f:
            doc = json.loads(line)
            total_pages = doc['metadata']['pdf-total-pages']
            fallback_pages = doc['metadata']['total-fallback-pages']
            
            if total_pages > 0:
                fallback_rate = fallback_pages / total_pages
                if fallback_rate > 0.1:  # More than 10% fallback
                    problematic.append({
                        'source': doc['metadata']['Source-File'],
                        'fallback_rate': fallback_rate,
                        'total_pages': total_pages
                    })

problematic.sort(key=lambda x: x['fallback_rate'], reverse=True)

print(f"Found {len(problematic)} documents with >10% fallback rate")
for doc in problematic[:10]:
    print(f"{doc['fallback_rate']*100:.1f}% fallback ({doc['total_pages']} pages): {doc['source']}")

Working with Page Spans

Extract text for specific pages:
import json

def get_page_text(doc, page_number):
    """Extract text for a specific page from a Dolma document."""
    for start, end, page_num in doc['attributes']['pdf_page_numbers']:
        if page_num == page_number:
            return doc['text'][start:end]
    return None

# Example usage
with open('workspace/results/output_abc123.jsonl', 'r') as f:
    for line in f:
        doc = json.loads(line)
        
        # Get text from page 1
        page1_text = get_page_text(doc, 1)
        print(f"Page 1 text: {page1_text[:200]}...")
        
        # Iterate through all pages
        for start, end, page_num in doc['attributes']['pdf_page_numbers']:
            page_text = doc['text'][start:end]
            print(f"Page {page_num}: {len(page_text)} characters")

Exporting Results

Export to Plain Text

Convert all documents to plain text files:
import json
import glob
import os

output_dir = "exported_texts"
os.makedirs(output_dir, exist_ok=True)

for filepath in glob.glob("workspace/results/output_*.jsonl"):
    with open(filepath, 'r') as f:
        for line in f:
            doc = json.loads(line)
            
            # Create filename from document ID
            filename = f"{doc['id']}.txt"
            output_path = os.path.join(output_dir, filename)
            
            # Write text to file
            with open(output_path, 'w') as out:
                out.write(doc['text'])

print(f"Exported texts to {output_dir}/")

Export to CSV

Create a CSV index of all documents:
import json
import glob
import csv

with open('document_index.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['ID', 'Source File', 'Pages', 'Output Tokens', 'Fallback Pages'])
    
    for filepath in glob.glob("workspace/results/output_*.jsonl"):
        with open(filepath, 'r') as f:
            for line in f:
                doc = json.loads(line)
                writer.writerow([
                    doc['id'],
                    doc['metadata']['Source-File'],
                    doc['metadata']['pdf-total-pages'],
                    doc['metadata']['total-output-tokens'],
                    doc['metadata']['total-fallback-pages']
                ])

print("Created document_index.csv")

Next Steps

Local Usage

Learn how to process PDFs on a single machine

Cluster Usage

Scale up to process millions of PDFs

Build docs developers (and LLMs) love