Skip to main content
The olmocr.data.renderpdf module provides functions for rendering PDF pages to images with precise dimension control.

Overview

This module uses Poppler utilities (pdfinfo, pdftoppm) to:
  • Extract PDF page dimensions
  • Render pages to PNG format
  • Convert to WebP format
  • Encode as base64 strings
  • Extract image dimensions efficiently

Functions

render_pdf_to_base64png

Renders a PDF page to a base64-encoded PNG image.
def render_pdf_to_base64png(
    local_pdf_path: str,
    page_num: int,
    target_longest_image_dim: int = 2048
) -> str
local_pdf_path
string
required
Path to the PDF file on local disk.
page_num
int
required
Page number to render (1-indexed).
target_longest_image_dim
int
Target dimension (in pixels) for the longest side of the output image. The image is rendered with appropriate DPI to achieve this dimension while maintaining aspect ratio.Default: 2048
Returns: Base64-encoded PNG image as a string. Raises:
  • AssertionError: If pdftoppm command fails
  • ValueError: If page dimensions cannot be determined

How It Works

  1. Extracts the PDF page’s MediaBox dimensions using pdfinfo
  2. Calculates the longest dimension (width or height)
  3. Computes required DPI: target_longest_image_dim * 72 / longest_dim
    • PDF dimensions are in points (1 point = 1/72 inch)
    • DPI controls pixel density during rendering
  4. Renders page to PNG using pdftoppm at calculated DPI
  5. Encodes PNG bytes as base64 string

Example Usage

from olmocr.data.renderpdf import render_pdf_to_base64png

# Render page 1 with longest side = 1024px
base64_image = render_pdf_to_base64png(
    local_pdf_path="/path/to/document.pdf",
    page_num=1,
    target_longest_image_dim=1024
)

print(f"Image length: {len(base64_image)} characters")
print(f"First 50 chars: {base64_image[:50]}")
Output:
Image length: 245832 characters
First 50 chars: iVBORw0KGgoAAAANSUhEUgAABQAAAASwCAYAAAAvZgCeAA

Using with Vision Models

import base64
from olmocr.data.renderpdf import render_pdf_to_base64png

# Render page for model input
base64_png = render_pdf_to_base64png(
    local_pdf_path="document.pdf",
    page_num=1,
    target_longest_image_dim=1024
)

# Use in API request
payload = {
    "model": "vision-model",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract text from this page"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_png}"
                    }
                }
            ]
        }
    ]
}

render_pdf_to_base64webp

Renders a PDF page to a base64-encoded WebP image.
def render_pdf_to_base64webp(
    local_pdf_path: str,
    page: int,
    target_longest_image_dim: int = 1024
) -> str
local_pdf_path
string
required
Path to the PDF file on local disk.
page
int
required
Page number to render (1-indexed).
target_longest_image_dim
int
Target dimension for the longest side of the output image.Default: 1024
Returns: Base64-encoded WebP image as a string.

Example Usage

from olmocr.data.renderpdf import render_pdf_to_base64webp

# Render as WebP (typically smaller file size)
base64_webp = render_pdf_to_base64webp(
    local_pdf_path="document.pdf",
    page=1,
    target_longest_image_dim=1024
)

# WebP usually produces smaller base64 strings
print(f"WebP image length: {len(base64_webp)}")

get_pdf_media_box_width_height

Extracts the MediaBox dimensions for a specific PDF page.
def get_pdf_media_box_width_height(
    local_pdf_path: str,
    page_num: int
) -> tuple[float, float]
local_pdf_path
string
required
Path to the PDF file.
page_num
int
required
Page number to query (1-indexed).
Returns: Tuple of (width, height) in points (1 point = 1/72 inch). Raises:
  • ValueError: If pdfinfo command fails or MediaBox not found

Example Usage

from olmocr.data.renderpdf import get_pdf_media_box_width_height

# Get dimensions for page 1
width, height = get_pdf_media_box_width_height(
    local_pdf_path="document.pdf",
    page_num=1
)

print(f"Page dimensions: {width} x {height} points")
print(f"Aspect ratio: {width/height:.2f}")

# Determine orientation
if width > height:
    print("Landscape orientation")
else:
    print("Portrait orientation")
Output:
Page dimensions: 612.0 x 792.0 points
Aspect ratio: 0.77
Portrait orientation

Understanding MediaBox

The MediaBox defines the boundaries of the physical medium (paper size):
  • Letter: 612 x 792 points (8.5” x 11”)
  • A4: 595 x 842 points (210mm x 297mm)
  • Legal: 612 x 1008 points (8.5” x 14”)

get_png_dimensions_from_base64

Extracts PNG image dimensions from base64 data without full decoding.
def get_png_dimensions_from_base64(
    base64_data: str
) -> tuple[int, int]
base64_data
string
required
Base64-encoded PNG image data.
Returns: Tuple of (width, height) in pixels. Raises:
  • ValueError: If data is not a valid PNG or dimensions cannot be extracted

How It Works

This function is highly optimized for performance:
  1. Validates PNG signature without decoding entire base64 string
  2. Computes minimal byte range needed for dimensions (bytes 16-24)
  3. Decodes only those specific bytes
  4. Extracts width/height as big-endian integers
Performance: Processes images in microseconds vs milliseconds for full decode.

Example Usage

from olmocr.data.renderpdf import (
    render_pdf_to_base64png,
    get_png_dimensions_from_base64
)

# Render PDF page
base64_png = render_pdf_to_base64png(
    local_pdf_path="document.pdf",
    page_num=1,
    target_longest_image_dim=2048
)

# Extract dimensions efficiently
width, height = get_png_dimensions_from_base64(base64_png)
print(f"Image dimensions: {width} x {height}")
print(f"Megapixels: {(width * height) / 1_000_000:.2f}")

# Verify target dimension
longest_side = max(width, height)
print(f"Longest side: {longest_side} (target: 2048)")
Output:
Image dimensions: 1583 x 2048
Megapixels: 3.24
Longest side: 2048 (target: 2048)

Image Dimension Handling

Target Dimension Calculation

The rendering process maintains aspect ratio while targeting a specific longest dimension:
def calculate_dpi(
    pdf_width: float,
    pdf_height: float,
    target_longest_dim: int
) -> float:
    """Calculate DPI needed to achieve target dimension"""
    longest_dim_points = max(pdf_width, pdf_height)
    # 72 points per inch is PDF standard
    dpi = (target_longest_dim * 72) / longest_dim_points
    return dpi

# Example: Letter page (612x792 pts) to 1024px longest side
dpi = calculate_dpi(612, 792, 1024)
# dpi = (1024 * 72) / 792 = 93.1 DPI

# Result: 792 points * (93.1 / 72) = 1024 pixels (height)
#         612 points * (93.1 / 72) = 791 pixels (width)

Example Dimensions

PDF SizePoints (W×H)TargetDPIOutput Pixels (W×H)
Letter612 × 792102493791 × 1024
Letter612 × 79220481861582 × 2048
A4595 × 842102488723 × 1024
Legal612 × 1008102473626 × 1024

Common Patterns

Adaptive Image Sizing

from olmocr.data.renderpdf import (
    get_pdf_media_box_width_height,
    render_pdf_to_base64png,
    get_png_dimensions_from_base64
)

def render_with_constraints(
    pdf_path: str,
    page: int,
    min_pixels: int = 1024,
    max_pixels: int = 2048
) -> str:
    """Render page with adaptive sizing based on content"""
    
    # Get original dimensions
    width, height = get_pdf_media_box_width_height(pdf_path, page)
    aspect = width / height
    
    # Adjust target based on aspect ratio
    if aspect > 2.0:  # Very wide page
        target = max_pixels
    elif aspect < 0.5:  # Very tall page
        target = max_pixels
    else:
        target = min_pixels
    
    # Render
    base64_img = render_pdf_to_base64png(
        local_pdf_path=pdf_path,
        page_num=page,
        target_longest_image_dim=target
    )
    
    # Verify output
    w, h = get_png_dimensions_from_base64(base64_img)
    print(f"Rendered {w}×{h} from {width:.0f}×{height:.0f}pt page")
    
    return base64_img

Batch Processing with Progress

from pypdf import PdfReader
from tqdm import tqdm
from olmocr.data.renderpdf import render_pdf_to_base64png

def render_all_pages(pdf_path: str, output_dir: str):
    """Render all pages in a PDF to individual images"""
    import base64
    import os
    
    reader = PdfReader(pdf_path)
    num_pages = len(reader.pages)
    
    os.makedirs(output_dir, exist_ok=True)
    
    for page_num in tqdm(range(1, num_pages + 1), desc="Rendering"):
        # Render page
        base64_png = render_pdf_to_base64png(
            local_pdf_path=pdf_path,
            page_num=page_num,
            target_longest_image_dim=1024
        )
        
        # Save to file
        output_path = os.path.join(
            output_dir,
            f"page_{page_num:04d}.png"
        )
        with open(output_path, "wb") as f:
            f.write(base64.b64decode(base64_png))

Requirements

This module requires Poppler utilities to be installed:

Ubuntu/Debian

sudo apt-get install poppler-utils

macOS

brew install poppler

Verify Installation

pdfinfo -v
pdftoppm -v

Performance Notes

  • Rendering speed: Typically 50-200ms per page depending on complexity and target size
  • Memory usage: Peak memory ~3-5x uncompressed image size
  • Base64 overhead: Base64 encoding increases size by ~33%
  • WebP vs PNG: WebP typically 25-35% smaller than PNG
  • Dimension extraction: Sub-millisecond using get_png_dimensions_from_base64

Error Handling

from olmocr.data.renderpdf import render_pdf_to_base64png

def safe_render(pdf_path: str, page: int) -> Optional[str]:
    """Render with error handling"""
    try:
        return render_pdf_to_base64png(
            local_pdf_path=pdf_path,
            page_num=page,
            target_longest_image_dim=1024
        )
    except ValueError as e:
        print(f"Failed to get page dimensions: {e}")
        return None
    except AssertionError as e:
        print(f"Rendering failed: {e}")
        return None
    except subprocess.TimeoutExpired:
        print(f"Rendering timed out for page {page}")
        return None

Build docs developers (and LLMs) love