The PdfFilter class provides automated filtering to exclude low-quality or irrelevant PDFs from your dataset. It combines language detection, form detection, and spam filtering to ensure high-quality training and test data.
Filtering is enabled by default in both buildsilver.py and buildtestset.py. Use --no_filter to disable.
from olmocr.filter import PdfFilterfrom lingua import Language# Create filter with default settingsfilter = PdfFilter()# Check if a PDF should be filtered outif filter.filter_out_pdf("/path/to/document.pdf"): print("PDF filtered out")else: print("PDF is good to use")
Extract first 5 pages of text using pdftotext (filter.py:79-89)
Detect language using Lingua
Filter out if language not in languages_to_keep
# Read the first five pages of text for language calculationpdftotext_result = subprocess.run( ["pdftotext", "-f", "1", "-l", "5", local_pdf_path, "-"], timeout=60, stdout=subprocess.PIPE, stderr=subprocess.PIPE,)if pdftotext_result.returncode != 0: return True # Filter outbase_text = pdftotext_result.stdout.decode("utf-8")# Check minimum text requirementsif len(base_text) < 200: return False # Keep - not enough text to analyzealpha_count = sum(c.isalpha() for c in base_text)if alpha_count / len(base_text) < 0.50: return False # Keep - might be OCRed badly# Language checklanguage = self.language_detector.detect_language_of(base_text)if language not in self.languages_to_keep: logger.info(f"Filtering out {local_pdf_path} because language was {language}") return True # Filter out
# Keep more documents, including possible OCR failuresfilter = PdfFilter( languages_to_keep=[Language.ENGLISH, None], apply_form_check=False, download_spam_threshold=0.01)
# Disable all filters (not recommended)filter = PdfFilter( languages_to_keep=[], # Keep all languages apply_form_check=False, apply_download_spam_check=False)# Or use --no_filter flag in scripts
Filtering is applied during PDF processing (buildsilver.py:25, 114-116):
pdf_filter = PdfFilter()def process_pdf(pdf_path: str, ..., no_filter: bool): if (not no_filter) and pdf_filter.filter_out_pdf(local_pdf_path): print(f"Skipping {local_pdf_path} due to common filter") return [] # Continue processing...
The filter is defensive and filters out problematic PDFs (filter.py:66-76):
try: pdf_reader = PdfReader(local_pdf_path) if self.apply_form_check and self._is_form(pdf_reader): return True # Filter outexcept Exception as e: logger.warning(f"Error reading PDF {local_pdf_path}: {e}") return True # Filter out the PDF if an exception occurs
Corrupt or unreadable PDFs are always filtered out.
INFO:olmocr.filter.filter:Filtering out document.pdf because it's a formINFO:olmocr.filter.filter:Filtering out spam.pdf because of SEO/download spamINFO:olmocr.filter.filter:Filtering out french.pdf because language was FRENCHINFO:olmocr.filter.filter:Keeping short.pdf on the safe side because not enough text exists