Overview
Thefilter module provides the PdfFilter class for identifying and filtering out low-quality, spam, or unsuitable PDF documents before processing. It performs language detection, form detection, and spam detection.
PdfFilter Class
Constructor Parameters
List of languages to accept. PDFs in other languages will be filtered out. Use
None in the list to keep documents where language detection fails (potentially OCR’d documents).Enable detection and filtering of PDF forms
Enable detection of SEO/download spam documents
Threshold for spam word ratio (0.004 = 0.4% of words are spam-related)
Methods
filter_out_pdf
Main filtering method that determines whether a PDF should be excluded from processing.Path to the local PDF file to analyze
True if the PDF should be filtered out (excluded), False if it should be kept
Filter Criteria
A PDF is filtered out if:- Read Error - Unable to open or read the PDF
- Form Detection - PDF contains form fields (when
apply_form_check=True) - Text Length - Document has sufficient text (≥200 chars) for analysis
- Alpha Ratio - Less than 50% alphabetic characters (kept on “safe side”)
- Language - Primary language not in
languages_to_keep - Spam Detection - SEO/download spam score exceeds threshold
Example Usage
Multi-language Example
Disable Specific Checks
Language Enum
The filter uses thelingua library for language detection. Common language values:
Internal Methods
_is_form
_is_download_spam
- download, pdf, epub, mobi
- free, ebook, file, save
- casino, viagra, cialis, ciprofloxacin
download_spam_threshold.
Filtering Workflow
Text Extraction
The filter usespdftotext to extract text from the first 5 pages:
Performance Considerations
Safe-side Filtering
When text analysis is ambiguous (too short, low alpha ratio), the filter errs on the side of keeping documents to avoid false negatives.
First 5 Pages Only
Language and spam detection only analyze the first 5 pages for performance. This is typically sufficient for accurate classification.
Timeout Protection
Text extraction has a 60-second timeout to prevent hanging on corrupted files.
Batch Processing Example
Related APIs
- Silver Data Generation - Uses PdfFilter before processing
- Prompts API - Prompt construction for accepted PDFs