Skip to main content
MarkItDown provides sophisticated PDF conversion with automatic table detection, form-style document handling, and fallback mechanisms for reliable text extraction.

Overview

The PDF converter uses a hybrid approach combining pdfplumber for structured content and pdfminer.six for plain text, automatically selecting the best method for each page.

Dependencies

pip install pdfminer.six pdfplumber
Or install with the pdf extras:
pip install markitdown[pdf]

Features

Smart Table Detection

Automatically identifies and extracts tables without visible borders

Form Processing

Recognizes form-style layouts and converts to structured tables

Text Extraction

Preserves paragraph structure and text spacing

Hybrid Processing

Uses best extraction method per page

Basic Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.markdown)

Conversion Strategies

The PDF converter employs three extraction strategies:

1. Form-Style Content

For documents with column-aligned text (invoices, forms, structured reports):
| Field Name          | Value             |
|---------------------|-------------------|
| Customer Name       | John Doe          |
| Invoice Number      | INV-2024-001      |
| Date                | 2024-02-15        |
| Total Amount        | $1,234.56         |
Detection Criteria:
  • Multiple column alignments detected
  • At least 20% of rows are table-like
  • Words align to consistent X-positions
  • Reasonable column density (not too many columns)

2. Borderless Tables

For tables without visible borders:
| Product | Quantity | Price |
|---------|----------|-------|
| Widget  | 10       | $5.00 |
| Gadget  | 5        | $10.00|
Detection Criteria:
  • 3-10 consistent columns
  • At least 3 rows
  • Short cell content (not prose paragraphs)
  • Words align to column positions

3. Plain Text

For traditional paragraph-based documents, uses pdfminer.six for superior text spacing and line breaks.

Advanced Features

The PDF converter includes special handling for MasterFormat-style numbered lists:Input PDF:
.1
The intent of this Request for Proposal...
.2  
Available information relative to...
Output Markdown:
.1 The intent of this Request for Proposal...
.2 Available information relative to...
Partial numbering patterns (.1, .2, etc.) are automatically merged with following text.
The converter uses statistical analysis to determine optimal column boundaries:
  • Calculates gaps between text positions
  • Uses 70th percentile of gaps as clustering threshold
  • Adapts to page width and content density
  • Prevents false positives in dense text
Tolerance Range: 25-50 pixels (adaptive) Max Columns: Scales with page width (standard: 15-20)
The converter tracks extraction success per page:
# Pseudo-code of the logic
for page in pdf.pages:
    form_content = extract_form_content(page)
    
    if form_content is None:
        # Not a form-style page, use pdfplumber basic extraction
        plain_pages += 1
        text = page.extract_text()
    else:
        # Successfully extracted as form/table
        form_pages += 1
        text = form_content

# If most pages are plain text, re-extract with pdfminer
if plain_pages > form_pages:
    use_pdfminer_for_all()

Implementation Details

Source Location

packages/markitdown/src/markitdown/converters/_pdf_converter.py

Converter Class

  • Class Name: PdfConverter
  • Accepted Extensions: .pdf
  • MIME Types: application/pdf, application/x-pdf

Key Functions

_extract_form_content_from_words(page)

Extracts form-style content by analyzing word positions:
  • Groups words by Y-position (rows)
  • Identifies column boundaries through X-position clustering
  • Distinguishes between table rows and paragraphs
  • Returns None if page is not form-style
Algorithm:
  1. Extract all words with positions
  2. Group words by Y-coordinate (rows)
  3. Cluster X-positions to find columns
  4. Classify rows as table/paragraph/list
  5. Detect table regions (consecutive table rows)
  6. Format tables with proper column alignment

_extract_tables_from_words(page)

Extracts borderless tables:
  • Identifies column starts across all rows
  • Requires 3-10 columns
  • Validates cell content length (not long prose)
  • Returns empty list if no tables found

_merge_partial_numbering_lines(text)

Post-processes text to merge MasterFormat-style numbering:
  • Matches pattern: ^\.\d+$
  • Merges with following non-empty line
  • Preserves other content unchanged

Dependencies Used

import pdfplumber

with pdfplumber.open(pdf_bytes) as pdf:
    for page in pdf.pages:
        words = page.extract_words(keep_blank_chars=True, x_tolerance=3, y_tolerance=3)
        text = page.extract_text()

Examples

Converting a Research Paper

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("research_paper.pdf")

# Output will use pdfminer for clean paragraph extraction
print(result.markdown)

Converting an Invoice

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("invoice.pdf")

# Output will detect form structure and create tables:
# | Item | Quantity | Price | Total |
# |------|----------|-------|-------|
# | ... | ... | ... | ... |
print(result.markdown)

Converting a Form

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("application_form.pdf")

# Output preserves field/value structure:
# | Field Name | Value |
# |------------|-------|
# | Name | John Doe |
# | Address | 123 Main St |
print(result.markdown)

Error Handling

from markitdown import MarkItDown
from markitdown._exceptions import MissingDependencyException

md = MarkItDown()
try:
    result = md.convert("document.pdf")
except MissingDependencyException:
    print("Install PDF dependencies: pip install markitdown[pdf]")
except Exception as e:
    print(f"Conversion error: {e}")
The converter includes multiple fallback layers:
  1. Try form-style extraction with pdfplumber
  2. Fall back to pdfplumber basic extraction
  3. Fall back to pdfminer.six if pdfplumber fails
  4. Return empty string if all methods fail

Performance Considerations

  • Memory: Loads entire PDF into memory (BytesIO)
  • Speed: Borderless table detection is computationally intensive
  • Large Files: May be slow on PDFs with many pages
Optimization Tips:
  • For pure text documents, consider pre-converting with pdfminer
  • For known table-heavy documents, the form detection provides best results
  • Processing time scales linearly with page count

Limitations

  • No OCR: Cannot extract text from scanned/image PDFs (use image converter with LLM instead)
  • No Image Extraction: Images are not extracted or converted
  • No Metadata: PDF metadata (author, title, etc.) is not extracted
  • Complex Layouts: Multi-column newspaper-style layouts may not be perfectly preserved

Next Steps

Images with OCR

Use the image converter for scanned PDFs

Document Intelligence

Use Azure Document Intelligence for advanced PDF processing

Build docs developers (and LLMs) love