PDF Rendering

The olmocr.data.renderpdf module provides functions for rendering PDF pages to images with precise dimension control.

Overview

This module uses Poppler utilities (pdfinfo, pdftoppm) to:

Extract PDF page dimensions
Render pages to PNG format
Convert to WebP format
Encode as base64 strings
Extract image dimensions efficiently

Functions

render_pdf_to_base64png

Renders a PDF page to a base64-encoded PNG image.

def render_pdf_to_base64png(
    local_pdf_path: str,
    page_num: int,
    target_longest_image_dim: int = 2048
) -> str

local_pdf_path

string

required

Path to the PDF file on local disk.

page_num

int

required

Page number to render (1-indexed).

target_longest_image_dim

int

Target dimension (in pixels) for the longest side of the output image. The image is rendered with appropriate DPI to achieve this dimension while maintaining aspect ratio.Default: 2048

Returns: Base64-encoded PNG image as a string. Raises:

AssertionError: If pdftoppm command fails
ValueError: If page dimensions cannot be determined

How It Works

Extracts the PDF page’s MediaBox dimensions using pdfinfo
Calculates the longest dimension (width or height)
Computes required DPI: target_longest_image_dim * 72 / longest_dim
- PDF dimensions are in points (1 point = 1/72 inch)
- DPI controls pixel density during rendering
Renders page to PNG using pdftoppm at calculated DPI
Encodes PNG bytes as base64 string

Example Usage

from olmocr.data.renderpdf import render_pdf_to_base64png

# Render page 1 with longest side = 1024px
base64_image = render_pdf_to_base64png(
    local_pdf_path="/path/to/document.pdf",
    page_num=1,
    target_longest_image_dim=1024
)

print(f"Image length: {len(base64_image)} characters")
print(f"First 50 chars: {base64_image[:50]}")

Output:

Image length: 245832 characters
First 50 chars: iVBORw0KGgoAAAANSUhEUgAABQAAAASwCAYAAAAvZgCeAA

Using with Vision Models

import base64
from olmocr.data.renderpdf import render_pdf_to_base64png

# Render page for model input
base64_png = render_pdf_to_base64png(
    local_pdf_path="document.pdf",
    page_num=1,
    target_longest_image_dim=1024
)

# Use in API request
payload = {
    "model": "vision-model",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract text from this page"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_png}"
                    }
                }
            ]
        }
    ]
}

render_pdf_to_base64webp

Renders a PDF page to a base64-encoded WebP image.

def render_pdf_to_base64webp(
    local_pdf_path: str,
    page: int,
    target_longest_image_dim: int = 1024
) -> str

local_pdf_path

string

required

Path to the PDF file on local disk.

page

int

required

Page number to render (1-indexed).

target_longest_image_dim

int

Target dimension for the longest side of the output image.Default: 1024

Returns: Base64-encoded WebP image as a string.

Example Usage

from olmocr.data.renderpdf import render_pdf_to_base64webp

# Render as WebP (typically smaller file size)
base64_webp = render_pdf_to_base64webp(
    local_pdf_path="document.pdf",
    page=1,
    target_longest_image_dim=1024
)

# WebP usually produces smaller base64 strings
print(f"WebP image length: {len(base64_webp)}")

get_pdf_media_box_width_height

Extracts the MediaBox dimensions for a specific PDF page.

def get_pdf_media_box_width_height(
    local_pdf_path: str,
    page_num: int
) -> tuple[float, float]

local_pdf_path

string

required

Path to the PDF file.

page_num

int

required

Page number to query (1-indexed).

Returns: Tuple of (width, height) in points (1 point = 1/72 inch). Raises:

ValueError: If pdfinfo command fails or MediaBox not found

Example Usage

from olmocr.data.renderpdf import get_pdf_media_box_width_height

# Get dimensions for page 1
width, height = get_pdf_media_box_width_height(
    local_pdf_path="document.pdf",
    page_num=1
)

print(f"Page dimensions: {width} x {height} points")
print(f"Aspect ratio: {width/height:.2f}")

# Determine orientation
if width > height:
    print("Landscape orientation")
else:
    print("Portrait orientation")

Output:

Page dimensions: 612.0 x 792.0 points
Aspect ratio: 0.77
Portrait orientation

Understanding MediaBox

The MediaBox defines the boundaries of the physical medium (paper size):

Letter: 612 x 792 points (8.5” x 11”)
A4: 595 x 842 points (210mm x 297mm)
Legal: 612 x 1008 points (8.5” x 14”)

get_png_dimensions_from_base64

Extracts PNG image dimensions from base64 data without full decoding.

def get_png_dimensions_from_base64(
    base64_data: str
) -> tuple[int, int]

base64_data

string

required

Base64-encoded PNG image data.

Returns: Tuple of (width, height) in pixels. Raises:

ValueError: If data is not a valid PNG or dimensions cannot be extracted

How It Works

This function is highly optimized for performance:

Validates PNG signature without decoding entire base64 string
Computes minimal byte range needed for dimensions (bytes 16-24)
Decodes only those specific bytes
Extracts width/height as big-endian integers

Performance: Processes images in microseconds vs milliseconds for full decode.

Example Usage

from olmocr.data.renderpdf import (
    render_pdf_to_base64png,
    get_png_dimensions_from_base64
)

# Render PDF page
base64_png = render_pdf_to_base64png(
    local_pdf_path="document.pdf",
    page_num=1,
    target_longest_image_dim=2048
)

# Extract dimensions efficiently
width, height = get_png_dimensions_from_base64(base64_png)
print(f"Image dimensions: {width} x {height}")
print(f"Megapixels: {(width * height) / 1_000_000:.2f}")

# Verify target dimension
longest_side = max(width, height)
print(f"Longest side: {longest_side} (target: 2048)")

Output:

Image dimensions: 1583 x 2048
Megapixels: 3.24
Longest side: 2048 (target: 2048)

Image Dimension Handling

Target Dimension Calculation

The rendering process maintains aspect ratio while targeting a specific longest dimension:

def calculate_dpi(
    pdf_width: float,
    pdf_height: float,
    target_longest_dim: int
) -> float:
    """Calculate DPI needed to achieve target dimension"""
    longest_dim_points = max(pdf_width, pdf_height)
    # 72 points per inch is PDF standard
    dpi = (target_longest_dim * 72) / longest_dim_points
    return dpi

# Example: Letter page (612x792 pts) to 1024px longest side
dpi = calculate_dpi(612, 792, 1024)
# dpi = (1024 * 72) / 792 = 93.1 DPI

# Result: 792 points * (93.1 / 72) = 1024 pixels (height)
#         612 points * (93.1 / 72) = 791 pixels (width)

Example Dimensions

PDF Size	Points (W×H)	Target	DPI	Output Pixels (W×H)
Letter	612 × 792	1024	93	791 × 1024
Letter	612 × 792	2048	186	1582 × 2048
A4	595 × 842	1024	88	723 × 1024
Legal	612 × 1008	1024	73	626 × 1024

Common Patterns

Adaptive Image Sizing

from olmocr.data.renderpdf import (
    get_pdf_media_box_width_height,
    render_pdf_to_base64png,
    get_png_dimensions_from_base64
)

def render_with_constraints(
    pdf_path: str,
    page: int,
    min_pixels: int = 1024,
    max_pixels: int = 2048
) -> str:
    """Render page with adaptive sizing based on content"""
    
    # Get original dimensions
    width, height = get_pdf_media_box_width_height(pdf_path, page)
    aspect = width / height
    
    # Adjust target based on aspect ratio
    if aspect > 2.0:  # Very wide page
        target = max_pixels
    elif aspect < 0.5:  # Very tall page
        target = max_pixels
    else:
        target = min_pixels
    
    # Render
    base64_img = render_pdf_to_base64png(
        local_pdf_path=pdf_path,
        page_num=page,
        target_longest_image_dim=target
    )
    
    # Verify output
    w, h = get_png_dimensions_from_base64(base64_img)
    print(f"Rendered {w}×{h} from {width:.0f}×{height:.0f}pt page")
    
    return base64_img

Batch Processing with Progress

from pypdf import PdfReader
from tqdm import tqdm
from olmocr.data.renderpdf import render_pdf_to_base64png

def render_all_pages(pdf_path: str, output_dir: str):
    """Render all pages in a PDF to individual images"""
    import base64
    import os
    
    reader = PdfReader(pdf_path)
    num_pages = len(reader.pages)
    
    os.makedirs(output_dir, exist_ok=True)
    
    for page_num in tqdm(range(1, num_pages + 1), desc="Rendering"):
        # Render page
        base64_png = render_pdf_to_base64png(
            local_pdf_path=pdf_path,
            page_num=page_num,
            target_longest_image_dim=1024
        )
        
        # Save to file
        output_path = os.path.join(
            output_dir,
            f"page_{page_num:04d}.png"
        )
        with open(output_path, "wb") as f:
            f.write(base64.b64decode(base64_png))

Requirements

This module requires Poppler utilities to be installed:

Ubuntu/Debian

sudo apt-get install poppler-utils

macOS

brew install poppler

Verify Installation

pdfinfo -v
pdftoppm -v

Performance Notes

Rendering speed: Typically 50-200ms per page depending on complexity and target size
Memory usage: Peak memory ~3-5x uncompressed image size
Base64 overhead: Base64 encoding increases size by ~33%
WebP vs PNG: WebP typically 25-35% smaller than PNG
Dimension extraction: Sub-millisecond using get_png_dimensions_from_base64

Error Handling

from olmocr.data.renderpdf import render_pdf_to_base64png

def safe_render(pdf_path: str, page: int) -> Optional[str]:
    """Render with error handling"""
    try:
        return render_pdf_to_base64png(
            local_pdf_path=pdf_path,
            page_num=page,
            target_longest_image_dim=1024
        )
    except ValueError as e:
        print(f"Failed to get page dimensions: {e}")
        return None
    except AssertionError as e:
        print(f"Rendering failed: {e}")
        return None
    except subprocess.TimeoutExpired:
        print(f"Rendering timed out for page {page}")
        return None

Pipeline Module - Uses rendering functions for page processing
Work Queue API - Coordinates distributed processing

Pipeline

Data Processing

Training & Evaluation

Utilities

Overview

Functions

render_pdf_to_base64png

How It Works

Example Usage

Using with Vision Models

render_pdf_to_base64webp

Example Usage

get_pdf_media_box_width_height

Example Usage

Understanding MediaBox

get_png_dimensions_from_base64

How It Works

Example Usage

Image Dimension Handling

Target Dimension Calculation

Example Dimensions

Common Patterns

Adaptive Image Sizing

Batch Processing with Progress

Requirements

Ubuntu/Debian

macOS

Verify Installation

Performance Notes

Error Handling

Build docs developers (and LLMs) love

Pipeline

Data Processing

Training & Evaluation

Utilities

​Overview

​Functions

​render_pdf_to_base64png

​How It Works

​Example Usage

​Using with Vision Models

​render_pdf_to_base64webp

​Example Usage

​get_pdf_media_box_width_height

​Example Usage

​Understanding MediaBox

​get_png_dimensions_from_base64

​How It Works

​Example Usage

​Image Dimension Handling

​Target Dimension Calculation

​Example Dimensions

​Common Patterns

​Adaptive Image Sizing

​Batch Processing with Progress

​Requirements

​Ubuntu/Debian

​macOS

​Verify Installation

​Performance Notes

​Error Handling

​Related

Build docs developers (and LLMs) love

Overview

Functions

render_pdf_to_base64png

How It Works

Example Usage

Using with Vision Models

render_pdf_to_base64webp

Example Usage

get_pdf_media_box_width_height

Example Usage

Understanding MediaBox

get_png_dimensions_from_base64

How It Works

Example Usage

Image Dimension Handling

Target Dimension Calculation

Example Dimensions

Common Patterns

Adaptive Image Sizing

Batch Processing with Progress

Requirements

Ubuntu/Debian

macOS

Verify Installation

Performance Notes

Error Handling

Related