Viewing Results

Overview

olmOCR outputs converted PDFs in Dolma format, a structured JSONL format designed for large-scale text processing. This guide covers how to access, analyze, and visualize your results.

Dolma Output Format

File Structure

Results are saved as JSONL (JSON Lines) files:

workspace/
└── results/
    ├── output_abc123.jsonl
    ├── output_def456.jsonl
    └── output_xyz789.jsonl

Each file contains one JSON object per line, representing a processed document.

Document Schema

Each Dolma document has this structure:

{
  "id": "a1b2c3d4e5f6...",
  "text": "Full extracted text from all pages...",
  "source": "olmocr",
  "added": "2026-03-03",
  "created": "2026-03-03",
  "metadata": {
    "Source-File": "s3://bucket/path/document.pdf",
    "olmocr-version": "0.2.5",
    "pdf-total-pages": 12,
    "total-input-tokens": 45678,
    "total-output-tokens": 8234,
    "total-fallback-pages": 0
  },
  "attributes": {
    "pdf_page_numbers": [
      [0, 1234, 1],
      [1234, 2456, 2],
      [2456, 3789, 3]
    ]
  }
}

Field Descriptions

string

SHA-1 hash of the document text, providing a unique identifier for deduplication.

text

string

The complete extracted text from all pages, with pages separated by newlines.

source

string

Always set to "olmocr" for documents processed by olmOCR.

metadata.Source-File

string

Original path to the PDF file (local path or S3 URI).

metadata.olmocr-version

string

Version of olmOCR used to process the document.

metadata.pdf-total-pages

integer

Total number of pages in the PDF.

metadata.total-input-tokens

integer

Total input tokens used for processing all pages (images + anchor text).

metadata.total-output-tokens

integer

Total output tokens generated by the model for all pages.

metadata.total-fallback-pages

integer

Number of pages that failed model processing and used fallback pdftotext extraction.

attributes.pdf_page_numbers

array

Array of [start_char, end_char, page_number] tuples indicating which character ranges in the text correspond to which PDF pages.For example, [0, 1234, 1] means characters 0-1234 in the text came from page 1.

Reading Output Files

Using Command Line

View the raw JSONL output:

# View all results
cat localworkspace/results/output_*.jsonl

# View a single document (pretty-printed)
head -n 1 localworkspace/results/output_abc123.jsonl | jq .

# Extract just the text
jq -r '.text' localworkspace/results/output_*.jsonl

# Count total documents
cat localworkspace/results/output_*.jsonl | wc -l

Using Python

Read and process results programmatically:

import json
import glob

# Read all output files
for filepath in glob.glob("localworkspace/results/output_*.jsonl"):
    with open(filepath, 'r') as f:
        for line in f:
            doc = json.loads(line)
            
            # Access document fields
            doc_id = doc['id']
            text = doc['text']
            source_file = doc['metadata']['Source-File']
            num_pages = doc['metadata']['pdf-total-pages']
            
            print(f"Document: {source_file}")
            print(f"Pages: {num_pages}")
            print(f"Preview: {text[:200]}...\n")

Reading from S3

For cluster workspaces, read directly from S3:

import boto3
import json
import smart_open

s3_path = "s3://my-bucket/workspace/results/output_abc123.jsonl"

with smart_open.smart_open(s3_path, 'r') as f:
    for line in f:
        doc = json.loads(line)
        print(doc['metadata']['Source-File'])

Using the HTML Viewer

The dolmaviewer tool creates side-by-side HTML visualizations showing the original PDF pages alongside extracted text.

Basic Usage

Generate HTML previews

python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl

Open in browser

HTML files are saved to ./dolma_previews/:

open dolma_previews/tests_gnarly_pdfs_horribleocr_pdf.html

Viewer Features

The HTML viewer displays:

Side-by-side view: Original PDF page images on the left, extracted text on the right
Page navigation: Scroll through all pages in a single HTML file
Formatted text: Markdown tables and formatting are rendered as HTML
Page numbers: Clear indication of which text corresponds to which page

Viewer Options

jsonl_paths

string | list

required

Path(s) to JSONL files. Supports glob patterns for local files and S3 URIs.

# Single file
python -m olmocr.viewer.dolmaviewer output.jsonl

# Multiple files with glob
python -m olmocr.viewer.dolmaviewer "results/*.jsonl"

# S3 path
python -m olmocr.viewer.dolmaviewer s3://bucket/workspace/results/output_*.jsonl

--output_dir

string

default:"dolma_previews"

Directory where HTML files will be saved.

python -m olmocr.viewer.dolmaviewer output.jsonl --output_dir ./previews

--template_path

string

default:"dolmaviewer_template.html"

Path to custom Jinja2 template file for HTML generation.

--s3_profile

string

AWS profile name for accessing S3 source documents. Required when viewing results from S3-based workspaces.

python -m olmocr.viewer.dolmaviewer \
  s3://bucket/workspace/results/output_*.jsonl \
  --s3_profile my-aws-profile

Viewing S3 Results

For cluster-based processing, the viewer can fetch PDFs from S3:

python -m olmocr.viewer.dolmaviewer \
  s3://my-bucket/workspace/results/output_*.jsonl \
  --s3_profile production \
  --output_dir ./s3_previews

The viewer generates pre-signed S3 URLs (valid for 1 week) for the original PDF files, allowing you to download them directly from the HTML viewer.

Analyzing Results

Extracting Statistics

Get processing statistics:

import json
import glob

total_docs = 0
total_pages = 0
total_output_tokens = 0
total_fallback_pages = 0

for filepath in glob.glob("workspace/results/output_*.jsonl"):
    with open(filepath, 'r') as f:
        for line in f:
            doc = json.loads(line)
            total_docs += 1
            total_pages += doc['metadata']['pdf-total-pages']
            total_output_tokens += doc['metadata']['total-output-tokens']
            total_fallback_pages += doc['metadata']['total-fallback-pages']

print(f"Total documents: {total_docs:,}")
print(f"Total pages: {total_pages:,}")
print(f"Total output tokens: {total_output_tokens:,}")
print(f"Average tokens per page: {total_output_tokens/total_pages:.1f}")
print(f"Fallback pages: {total_fallback_pages:,} ({100*total_fallback_pages/total_pages:.2f}%)")

Finding Long Documents

Identify documents with many tokens:

import json
import glob

long_docs = []
THRESHOLD = 32768  # tokens

for filepath in glob.glob("workspace/results/output_*.jsonl"):
    with open(filepath, 'r') as f:
        for line in f:
            doc = json.loads(line)
            tokens = doc['metadata']['total-output-tokens']
            if tokens > THRESHOLD:
                long_docs.append({
                    'source': doc['metadata']['Source-File'],
                    'tokens': tokens,
                    'pages': doc['metadata']['pdf-total-pages']
                })

long_docs.sort(key=lambda x: x['tokens'], reverse=True)

print(f"Found {len(long_docs)} documents with >{THRESHOLD:,} tokens")
for doc in long_docs[:10]:
    print(f"{doc['tokens']:,} tokens, {doc['pages']} pages: {doc['source']}")

Identifying Problematic Documents

Find documents with high fallback rates:

import json
import glob

problematic = []

for filepath in glob.glob("workspace/results/output_*.jsonl"):
    with open(filepath, 'r') as f:
        for line in f:
            doc = json.loads(line)
            total_pages = doc['metadata']['pdf-total-pages']
            fallback_pages = doc['metadata']['total-fallback-pages']
            
            if total_pages > 0:
                fallback_rate = fallback_pages / total_pages
                if fallback_rate > 0.1:  # More than 10% fallback
                    problematic.append({
                        'source': doc['metadata']['Source-File'],
                        'fallback_rate': fallback_rate,
                        'total_pages': total_pages
                    })

problematic.sort(key=lambda x: x['fallback_rate'], reverse=True)

print(f"Found {len(problematic)} documents with >10% fallback rate")
for doc in problematic[:10]:
    print(f"{doc['fallback_rate']*100:.1f}% fallback ({doc['total_pages']} pages): {doc['source']}")

Working with Page Spans

Extract text for specific pages:

import json

def get_page_text(doc, page_number):
    """Extract text for a specific page from a Dolma document."""
    for start, end, page_num in doc['attributes']['pdf_page_numbers']:
        if page_num == page_number:
            return doc['text'][start:end]
    return None

# Example usage
with open('workspace/results/output_abc123.jsonl', 'r') as f:
    for line in f:
        doc = json.loads(line)
        
        # Get text from page 1
        page1_text = get_page_text(doc, 1)
        print(f"Page 1 text: {page1_text[:200]}...")
        
        # Iterate through all pages
        for start, end, page_num in doc['attributes']['pdf_page_numbers']:
            page_text = doc['text'][start:end]
            print(f"Page {page_num}: {len(page_text)} characters")

Exporting Results

Export to Plain Text

Convert all documents to plain text files:

import json
import glob
import os

output_dir = "exported_texts"
os.makedirs(output_dir, exist_ok=True)

for filepath in glob.glob("workspace/results/output_*.jsonl"):
    with open(filepath, 'r') as f:
        for line in f:
            doc = json.loads(line)
            
            # Create filename from document ID
            filename = f"{doc['id']}.txt"
            output_path = os.path.join(output_dir, filename)
            
            # Write text to file
            with open(output_path, 'w') as out:
                out.write(doc['text'])

print(f"Exported texts to {output_dir}/")

Export to CSV

Create a CSV index of all documents:

import json
import glob
import csv

with open('document_index.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['ID', 'Source File', 'Pages', 'Output Tokens', 'Fallback Pages'])
    
    for filepath in glob.glob("workspace/results/output_*.jsonl"):
        with open(filepath, 'r') as f:
            for line in f:
                doc = json.loads(line)
                writer.writerow([
                    doc['id'],
                    doc['metadata']['Source-File'],
                    doc['metadata']['pdf-total-pages'],
                    doc['metadata']['total-output-tokens'],
                    doc['metadata']['total-fallback-pages']
                ])

print("Created document_index.csv")

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

Viewing Results

Overview

Dolma Output Format

File Structure

Document Schema

Field Descriptions

Reading Output Files

Using Command Line

Using Python

Reading from S3

Using the HTML Viewer

Basic Usage

Viewer Features

Viewer Options

Viewing S3 Results

Analyzing Results

Extracting Statistics

Finding Long Documents

Identifying Problematic Documents

Working with Page Spans

Exporting Results

Export to Plain Text

Export to CSV

Next Steps

Local Usage

Cluster Usage

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

​Overview

​Dolma Output Format

​File Structure

​Document Schema

​Field Descriptions

​Reading Output Files

​Using Command Line

​Using Python

​Reading from S3

​Using the HTML Viewer

​Basic Usage

​Viewer Features

​Viewer Options

​Viewing S3 Results

​Analyzing Results

​Extracting Statistics

​Finding Long Documents

​Identifying Problematic Documents

​Working with Page Spans

​Exporting Results

​Export to Plain Text

​Export to CSV

​Next Steps

Local Usage

Cluster Usage

Build docs developers (and LLMs) love

Overview

Dolma Output Format

File Structure

Document Schema

Field Descriptions

Reading Output Files

Using Command Line

Using Python

Reading from S3

Using the HTML Viewer

Basic Usage

Viewer Features

Viewer Options

Viewing S3 Results

Analyzing Results

Extracting Statistics

Finding Long Documents

Identifying Problematic Documents

Working with Page Spans

Exporting Results

Export to Plain Text

Export to CSV

Next Steps