PDF Filtering

Overview

The filter module provides the PdfFilter class for identifying and filtering out low-quality, spam, or unsuitable PDF documents before processing. It performs language detection, form detection, and spam detection.

PdfFilter Class

from olmocr.filter import PdfFilter

filter = PdfFilter(
    languages_to_keep=[Language.ENGLISH],
    apply_form_check=True,
    apply_download_spam_check=True,
    download_spam_threshold=0.004
)

Constructor Parameters

languages_to_keep

List[Language]

default:"[Language.ENGLISH]"

List of languages to accept. PDFs in other languages will be filtered out. Use None in the list to keep documents where language detection fails (potentially OCR’d documents).

apply_form_check

bool

default:"true"

Enable detection and filtering of PDF forms

apply_download_spam_check

bool

default:"true"

Enable detection of SEO/download spam documents

download_spam_threshold

float

default:"0.004"

Threshold for spam word ratio (0.004 = 0.4% of words are spam-related)

Methods

filter_out_pdf

Main filtering method that determines whether a PDF should be excluded from processing.

def filter_out_pdf(local_pdf_path: str) -> bool

local_pdf_path

str

required

Path to the local PDF file to analyze

Returns: True if the PDF should be filtered out (excluded), False if it should be kept

Filter Criteria

A PDF is filtered out if:

Read Error - Unable to open or read the PDF
Form Detection - PDF contains form fields (when apply_form_check=True)
Text Length - Document has sufficient text (≥200 chars) for analysis
Alpha Ratio - Less than 50% alphabetic characters (kept on “safe side”)
Language - Primary language not in languages_to_keep
Spam Detection - SEO/download spam score exceeds threshold

Example Usage

from olmocr.filter import PdfFilter
from lingua import Language

# Initialize filter for English documents
pdf_filter = PdfFilter(
    languages_to_keep=[Language.ENGLISH],
    apply_form_check=True,
    apply_download_spam_check=True,
    download_spam_threshold=0.004
)

# Check if a PDF should be filtered
if pdf_filter.filter_out_pdf("/path/to/document.pdf"):
    print("PDF filtered out")
else:
    print("PDF accepted for processing")

Multi-language Example

from olmocr.filter import PdfFilter
from lingua import Language

# Accept English and Spanish documents, plus undetectable languages
pdf_filter = PdfFilter(
    languages_to_keep=[Language.ENGLISH, Language.SPANISH, None],
    apply_form_check=True,
    apply_download_spam_check=True
)

Disable Specific Checks

# Only filter by language, skip form and spam checks
pdf_filter = PdfFilter(
    languages_to_keep=[Language.ENGLISH],
    apply_form_check=False,
    apply_download_spam_check=False
)

Language Enum

The filter uses the lingua library for language detection. Common language values:

from lingua import Language

# Common languages
Language.ENGLISH
Language.SPANISH
Language.FRENCH
Language.GERMAN
Language.CHINESE
Language.JAPANESE
# ... and many more

See the lingua-language-detector documentation for the full list of supported languages.

Internal Methods

_is_form

def _is_form(pdf_reader) -> bool

Detects if a PDF contains form fields using PyPDF’s form detection.

_is_download_spam

def _is_download_spam(base_text: str) -> bool

Analyzes text for spam keywords and calculates a spam score. Spam Keywords Detected:

download, pdf, epub, mobi
free, ebook, file, save
casino, viagra, cialis, ciprofloxacin

A document is considered spam if the ratio of spam words to total words exceeds download_spam_threshold.

Filtering Workflow

Text Extraction

The filter uses pdftotext to extract text from the first 5 pages:

subprocess.run(
    ["pdftotext", "-f", "1", "-l", "5", local_pdf_path, "-"],
    timeout=60,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
)

Requires pdftotext to be installed on the system (part of poppler-utils).

Performance Considerations

Safe-side Filtering

When text analysis is ambiguous (too short, low alpha ratio), the filter errs on the side of keeping documents to avoid false negatives.

First 5 Pages Only

Language and spam detection only analyze the first 5 pages for performance. This is typically sufficient for accurate classification.

Timeout Protection

Text extraction has a 60-second timeout to prevent hanging on corrupted files.

Batch Processing Example

from olmocr.filter import PdfFilter
from lingua import Language
from tqdm import tqdm

pdf_filter = PdfFilter(
    languages_to_keep=[Language.ENGLISH, None],
    apply_download_spam_check=True,
    apply_form_check=True,
)

pdf_paths = [...]  # Your list of PDF paths

keep_paths = []
remove_paths = []

for pdf_path in tqdm(pdf_paths, desc="Filtering PDFs"):
    if pdf_filter.filter_out_pdf(pdf_path):
        remove_paths.append(pdf_path)
    else:
        keep_paths.append(pdf_path)

print(f"Kept: {len(keep_paths)}, Removed: {len(remove_paths)}")

Silver Data Generation - Uses PdfFilter before processing
Prompts API - Prompt construction for accepted PDFs

Pipeline

Data Processing

Training & Evaluation

Utilities

Overview

PdfFilter Class

Constructor Parameters

Methods

filter_out_pdf

Filter Criteria

Example Usage

Multi-language Example

Disable Specific Checks

Language Enum

Internal Methods

_is_form

_is_download_spam

Filtering Workflow

Text Extraction

Performance Considerations

Safe-side Filtering

First 5 Pages Only

Timeout Protection

Batch Processing Example

Build docs developers (and LLMs) love

Pipeline

Data Processing

Training & Evaluation

Utilities

​Overview

​PdfFilter Class

​Constructor Parameters

​Methods

​filter_out_pdf

​Filter Criteria

​Example Usage

​Multi-language Example

​Disable Specific Checks

​Language Enum

​Internal Methods

​_is_form

​_is_download_spam

​Filtering Workflow

​Text Extraction

​Performance Considerations

Safe-side Filtering

First 5 Pages Only

Timeout Protection

​Batch Processing Example

​Related APIs

Build docs developers (and LLMs) love

Overview

PdfFilter Class

Constructor Parameters

Methods

filter_out_pdf

Filter Criteria

Example Usage

Multi-language Example

Disable Specific Checks

Language Enum

Internal Methods

_is_form

_is_download_spam

Filtering Workflow

Text Extraction

Performance Considerations

Batch Processing Example

Related APIs