Overview
The PDF converter uses a hybrid approach combiningpdfplumber for structured content and pdfminer.six for plain text, automatically selecting the best method for each page.
Dependencies
Features
Smart Table Detection
Automatically identifies and extracts tables without visible borders
Form Processing
Recognizes form-style layouts and converts to structured tables
Text Extraction
Preserves paragraph structure and text spacing
Hybrid Processing
Uses best extraction method per page
Basic Usage
Conversion Strategies
The PDF converter employs three extraction strategies:1. Form-Style Content
For documents with column-aligned text (invoices, forms, structured reports):- Multiple column alignments detected
- At least 20% of rows are table-like
- Words align to consistent X-positions
- Reasonable column density (not too many columns)
2. Borderless Tables
For tables without visible borders:- 3-10 consistent columns
- At least 3 rows
- Short cell content (not prose paragraphs)
- Words align to column positions
3. Plain Text
For traditional paragraph-based documents, usespdfminer.six for superior text spacing and line breaks.
Advanced Features
MasterFormat Support
MasterFormat Support
The PDF converter includes special handling for MasterFormat-style numbered lists:Input PDF:Output Markdown:Partial numbering patterns (
.1, .2, etc.) are automatically merged with following text.Adaptive Column Detection
Adaptive Column Detection
The converter uses statistical analysis to determine optimal column boundaries:
- Calculates gaps between text positions
- Uses 70th percentile of gaps as clustering threshold
- Adapts to page width and content density
- Prevents false positives in dense text
Hybrid Page Processing
Hybrid Page Processing
The converter tracks extraction success per page:
Implementation Details
Source Location
Converter Class
- Class Name:
PdfConverter - Accepted Extensions:
.pdf - MIME Types:
application/pdf,application/x-pdf
Key Functions
_extract_form_content_from_words(page)
Extracts form-style content by analyzing word positions:
- Groups words by Y-position (rows)
- Identifies column boundaries through X-position clustering
- Distinguishes between table rows and paragraphs
- Returns
Noneif page is not form-style
- Extract all words with positions
- Group words by Y-coordinate (rows)
- Cluster X-positions to find columns
- Classify rows as table/paragraph/list
- Detect table regions (consecutive table rows)
- Format tables with proper column alignment
_extract_tables_from_words(page)
Extracts borderless tables:
- Identifies column starts across all rows
- Requires 3-10 columns
- Validates cell content length (not long prose)
- Returns empty list if no tables found
_merge_partial_numbering_lines(text)
Post-processes text to merge MasterFormat-style numbering:
- Matches pattern:
^\.\d+$ - Merges with following non-empty line
- Preserves other content unchanged
Dependencies Used
Examples
Converting a Research Paper
Converting an Invoice
Converting a Form
Error Handling
- Try form-style extraction with pdfplumber
- Fall back to pdfplumber basic extraction
- Fall back to pdfminer.six if pdfplumber fails
- Return empty string if all methods fail
Performance Considerations
- Memory: Loads entire PDF into memory (BytesIO)
- Speed: Borderless table detection is computationally intensive
- Large Files: May be slow on PDFs with many pages
- For pure text documents, consider pre-converting with pdfminer
- For known table-heavy documents, the form detection provides best results
- Processing time scales linearly with page count
Limitations
Next Steps
Images with OCR
Use the image converter for scanned PDFs
Document Intelligence
Use Azure Document Intelligence for advanced PDF processing