The olmocr.data.renderpdf module provides functions for rendering PDF pages to images with precise dimension control.
Overview
This module uses Poppler utilities (pdfinfo, pdftoppm) to:
- Extract PDF page dimensions
- Render pages to PNG format
- Convert to WebP format
- Encode as base64 strings
- Extract image dimensions efficiently
Functions
render_pdf_to_base64png
Renders a PDF page to a base64-encoded PNG image.
def render_pdf_to_base64png(
local_pdf_path: str,
page_num: int,
target_longest_image_dim: int = 2048
) -> str
Path to the PDF file on local disk.
Page number to render (1-indexed).
Target dimension (in pixels) for the longest side of the output image. The image is rendered with appropriate DPI to achieve this dimension while maintaining aspect ratio.Default: 2048
Returns: Base64-encoded PNG image as a string.
Raises:
AssertionError: If pdftoppm command fails
ValueError: If page dimensions cannot be determined
How It Works
- Extracts the PDF page’s MediaBox dimensions using
pdfinfo
- Calculates the longest dimension (width or height)
- Computes required DPI:
target_longest_image_dim * 72 / longest_dim
- PDF dimensions are in points (1 point = 1/72 inch)
- DPI controls pixel density during rendering
- Renders page to PNG using
pdftoppm at calculated DPI
- Encodes PNG bytes as base64 string
Example Usage
from olmocr.data.renderpdf import render_pdf_to_base64png
# Render page 1 with longest side = 1024px
base64_image = render_pdf_to_base64png(
local_pdf_path="/path/to/document.pdf",
page_num=1,
target_longest_image_dim=1024
)
print(f"Image length: {len(base64_image)} characters")
print(f"First 50 chars: {base64_image[:50]}")
Output:
Image length: 245832 characters
First 50 chars: iVBORw0KGgoAAAANSUhEUgAABQAAAASwCAYAAAAvZgCeAA
Using with Vision Models
import base64
from olmocr.data.renderpdf import render_pdf_to_base64png
# Render page for model input
base64_png = render_pdf_to_base64png(
local_pdf_path="document.pdf",
page_num=1,
target_longest_image_dim=1024
)
# Use in API request
payload = {
"model": "vision-model",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Extract text from this page"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{base64_png}"
}
}
]
}
]
}
render_pdf_to_base64webp
Renders a PDF page to a base64-encoded WebP image.
def render_pdf_to_base64webp(
local_pdf_path: str,
page: int,
target_longest_image_dim: int = 1024
) -> str
Path to the PDF file on local disk.
Page number to render (1-indexed).
Target dimension for the longest side of the output image.Default: 1024
Returns: Base64-encoded WebP image as a string.
Example Usage
from olmocr.data.renderpdf import render_pdf_to_base64webp
# Render as WebP (typically smaller file size)
base64_webp = render_pdf_to_base64webp(
local_pdf_path="document.pdf",
page=1,
target_longest_image_dim=1024
)
# WebP usually produces smaller base64 strings
print(f"WebP image length: {len(base64_webp)}")
Extracts the MediaBox dimensions for a specific PDF page.
def get_pdf_media_box_width_height(
local_pdf_path: str,
page_num: int
) -> tuple[float, float]
Page number to query (1-indexed).
Returns: Tuple of (width, height) in points (1 point = 1/72 inch).
Raises:
ValueError: If pdfinfo command fails or MediaBox not found
Example Usage
from olmocr.data.renderpdf import get_pdf_media_box_width_height
# Get dimensions for page 1
width, height = get_pdf_media_box_width_height(
local_pdf_path="document.pdf",
page_num=1
)
print(f"Page dimensions: {width} x {height} points")
print(f"Aspect ratio: {width/height:.2f}")
# Determine orientation
if width > height:
print("Landscape orientation")
else:
print("Portrait orientation")
Output:
Page dimensions: 612.0 x 792.0 points
Aspect ratio: 0.77
Portrait orientation
The MediaBox defines the boundaries of the physical medium (paper size):
- Letter: 612 x 792 points (8.5” x 11”)
- A4: 595 x 842 points (210mm x 297mm)
- Legal: 612 x 1008 points (8.5” x 14”)
get_png_dimensions_from_base64
Extracts PNG image dimensions from base64 data without full decoding.
def get_png_dimensions_from_base64(
base64_data: str
) -> tuple[int, int]
Base64-encoded PNG image data.
Returns: Tuple of (width, height) in pixels.
Raises:
ValueError: If data is not a valid PNG or dimensions cannot be extracted
How It Works
This function is highly optimized for performance:
- Validates PNG signature without decoding entire base64 string
- Computes minimal byte range needed for dimensions (bytes 16-24)
- Decodes only those specific bytes
- Extracts width/height as big-endian integers
Performance: Processes images in microseconds vs milliseconds for full decode.
Example Usage
from olmocr.data.renderpdf import (
render_pdf_to_base64png,
get_png_dimensions_from_base64
)
# Render PDF page
base64_png = render_pdf_to_base64png(
local_pdf_path="document.pdf",
page_num=1,
target_longest_image_dim=2048
)
# Extract dimensions efficiently
width, height = get_png_dimensions_from_base64(base64_png)
print(f"Image dimensions: {width} x {height}")
print(f"Megapixels: {(width * height) / 1_000_000:.2f}")
# Verify target dimension
longest_side = max(width, height)
print(f"Longest side: {longest_side} (target: 2048)")
Output:
Image dimensions: 1583 x 2048
Megapixels: 3.24
Longest side: 2048 (target: 2048)
Image Dimension Handling
Target Dimension Calculation
The rendering process maintains aspect ratio while targeting a specific longest dimension:
def calculate_dpi(
pdf_width: float,
pdf_height: float,
target_longest_dim: int
) -> float:
"""Calculate DPI needed to achieve target dimension"""
longest_dim_points = max(pdf_width, pdf_height)
# 72 points per inch is PDF standard
dpi = (target_longest_dim * 72) / longest_dim_points
return dpi
# Example: Letter page (612x792 pts) to 1024px longest side
dpi = calculate_dpi(612, 792, 1024)
# dpi = (1024 * 72) / 792 = 93.1 DPI
# Result: 792 points * (93.1 / 72) = 1024 pixels (height)
# 612 points * (93.1 / 72) = 791 pixels (width)
Example Dimensions
| PDF Size | Points (W×H) | Target | DPI | Output Pixels (W×H) |
|---|
| Letter | 612 × 792 | 1024 | 93 | 791 × 1024 |
| Letter | 612 × 792 | 2048 | 186 | 1582 × 2048 |
| A4 | 595 × 842 | 1024 | 88 | 723 × 1024 |
| Legal | 612 × 1008 | 1024 | 73 | 626 × 1024 |
Common Patterns
Adaptive Image Sizing
from olmocr.data.renderpdf import (
get_pdf_media_box_width_height,
render_pdf_to_base64png,
get_png_dimensions_from_base64
)
def render_with_constraints(
pdf_path: str,
page: int,
min_pixels: int = 1024,
max_pixels: int = 2048
) -> str:
"""Render page with adaptive sizing based on content"""
# Get original dimensions
width, height = get_pdf_media_box_width_height(pdf_path, page)
aspect = width / height
# Adjust target based on aspect ratio
if aspect > 2.0: # Very wide page
target = max_pixels
elif aspect < 0.5: # Very tall page
target = max_pixels
else:
target = min_pixels
# Render
base64_img = render_pdf_to_base64png(
local_pdf_path=pdf_path,
page_num=page,
target_longest_image_dim=target
)
# Verify output
w, h = get_png_dimensions_from_base64(base64_img)
print(f"Rendered {w}×{h} from {width:.0f}×{height:.0f}pt page")
return base64_img
Batch Processing with Progress
from pypdf import PdfReader
from tqdm import tqdm
from olmocr.data.renderpdf import render_pdf_to_base64png
def render_all_pages(pdf_path: str, output_dir: str):
"""Render all pages in a PDF to individual images"""
import base64
import os
reader = PdfReader(pdf_path)
num_pages = len(reader.pages)
os.makedirs(output_dir, exist_ok=True)
for page_num in tqdm(range(1, num_pages + 1), desc="Rendering"):
# Render page
base64_png = render_pdf_to_base64png(
local_pdf_path=pdf_path,
page_num=page_num,
target_longest_image_dim=1024
)
# Save to file
output_path = os.path.join(
output_dir,
f"page_{page_num:04d}.png"
)
with open(output_path, "wb") as f:
f.write(base64.b64decode(base64_png))
Requirements
This module requires Poppler utilities to be installed:
Ubuntu/Debian
sudo apt-get install poppler-utils
macOS
Verify Installation
- Rendering speed: Typically 50-200ms per page depending on complexity and target size
- Memory usage: Peak memory ~3-5x uncompressed image size
- Base64 overhead: Base64 encoding increases size by ~33%
- WebP vs PNG: WebP typically 25-35% smaller than PNG
- Dimension extraction: Sub-millisecond using
get_png_dimensions_from_base64
Error Handling
from olmocr.data.renderpdf import render_pdf_to_base64png
def safe_render(pdf_path: str, page: int) -> Optional[str]:
"""Render with error handling"""
try:
return render_pdf_to_base64png(
local_pdf_path=pdf_path,
page_num=page,
target_longest_image_dim=1024
)
except ValueError as e:
print(f"Failed to get page dimensions: {e}")
return None
except AssertionError as e:
print(f"Rendering failed: {e}")
return None
except subprocess.TimeoutExpired:
print(f"Rendering timed out for page {page}")
return None