PDF Documents

MarkItDown provides sophisticated PDF conversion with automatic table detection, form-style document handling, and fallback mechanisms for reliable text extraction.

Overview

The PDF converter uses a hybrid approach combining pdfplumber for structured content and pdfminer.six for plain text, automatically selecting the best method for each page.

Dependencies

pip install pdfminer.six pdfplumber

Or install with the pdf extras:

pip install markitdown[pdf]

Features

Smart Table Detection

Automatically identifies and extracts tables without visible borders

Form Processing

Recognizes form-style layouts and converts to structured tables

Text Extraction

Preserves paragraph structure and text spacing

Hybrid Processing

Uses best extraction method per page

Basic Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.markdown)

Conversion Strategies

The PDF converter employs three extraction strategies:

1. Form-Style Content

For documents with column-aligned text (invoices, forms, structured reports):

| Field Name          | Value             |
|---------------------|-------------------|
| Customer Name       | John Doe          |
| Invoice Number      | INV-2024-001      |
| Date                | 2024-02-15        |
| Total Amount        | $1,234.56         |

Detection Criteria:

Multiple column alignments detected
At least 20% of rows are table-like
Words align to consistent X-positions
Reasonable column density (not too many columns)

2. Borderless Tables

For tables without visible borders:

| Product | Quantity | Price |
|---------|----------|-------|
| Widget  | 10       | $5.00 |
| Gadget  | 5        | $10.00|

Detection Criteria:

3-10 consistent columns
At least 3 rows
Short cell content (not prose paragraphs)
Words align to column positions

3. Plain Text

For traditional paragraph-based documents, uses pdfminer.six for superior text spacing and line breaks.

Advanced Features

MasterFormat Support

The PDF converter includes special handling for MasterFormat-style numbered lists:Input PDF:

.1
The intent of this Request for Proposal...
.2  
Available information relative to...

Output Markdown:

.1 The intent of this Request for Proposal...
.2 Available information relative to...

Partial numbering patterns (.1, .2, etc.) are automatically merged with following text.

Adaptive Column Detection

The converter uses statistical analysis to determine optimal column boundaries:

Calculates gaps between text positions
Uses 70th percentile of gaps as clustering threshold
Adapts to page width and content density
Prevents false positives in dense text

Tolerance Range: 25-50 pixels (adaptive) Max Columns: Scales with page width (standard: 15-20)

Hybrid Page Processing

The converter tracks extraction success per page:

# Pseudo-code of the logic
for page in pdf.pages:
    form_content = extract_form_content(page)
    
    if form_content is None:
        # Not a form-style page, use pdfplumber basic extraction
        plain_pages += 1
        text = page.extract_text()
    else:
        # Successfully extracted as form/table
        form_pages += 1
        text = form_content

# If most pages are plain text, re-extract with pdfminer
if plain_pages > form_pages:
    use_pdfminer_for_all()

Implementation Details

Source Location

packages/markitdown/src/markitdown/converters/_pdf_converter.py

Converter Class

Class Name: PdfConverter
Accepted Extensions: .pdf
MIME Types: application/pdf, application/x-pdf

Key Functions

`_extract_form_content_from_words(page)`

Extracts form-style content by analyzing word positions:

Groups words by Y-position (rows)
Identifies column boundaries through X-position clustering
Distinguishes between table rows and paragraphs
Returns None if page is not form-style

Algorithm:

Extract all words with positions
Group words by Y-coordinate (rows)
Cluster X-positions to find columns
Classify rows as table/paragraph/list
Detect table regions (consecutive table rows)
Format tables with proper column alignment

`_extract_tables_from_words(page)`

Extracts borderless tables:

Identifies column starts across all rows
Requires 3-10 columns
Validates cell content length (not long prose)
Returns empty list if no tables found

`_merge_partial_numbering_lines(text)`

Post-processes text to merge MasterFormat-style numbering:

Matches pattern: ^\.\d+$
Merges with following non-empty line
Preserves other content unchanged

Dependencies Used

import pdfplumber

with pdfplumber.open(pdf_bytes) as pdf:
    for page in pdf.pages:
        words = page.extract_words(keep_blank_chars=True, x_tolerance=3, y_tolerance=3)
        text = page.extract_text()

Examples

Converting a Research Paper

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("research_paper.pdf")

# Output will use pdfminer for clean paragraph extraction
print(result.markdown)

Converting an Invoice

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("invoice.pdf")

# Output will detect form structure and create tables:
# | Item | Quantity | Price | Total |
# |------|----------|-------|-------|
# | ... | ... | ... | ... |
print(result.markdown)

Converting a Form

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("application_form.pdf")

# Output preserves field/value structure:
# | Field Name | Value |
# |------------|-------|
# | Name | John Doe |
# | Address | 123 Main St |
print(result.markdown)

Error Handling

from markitdown import MarkItDown
from markitdown._exceptions import MissingDependencyException

md = MarkItDown()
try:
    result = md.convert("document.pdf")
except MissingDependencyException:
    print("Install PDF dependencies: pip install markitdown[pdf]")
except Exception as e:
    print(f"Conversion error: {e}")

The converter includes multiple fallback layers:

Try form-style extraction with pdfplumber
Fall back to pdfplumber basic extraction
Fall back to pdfminer.six if pdfplumber fails
Return empty string if all methods fail

Performance Considerations

Memory: Loads entire PDF into memory (BytesIO)
Speed: Borderless table detection is computationally intensive
Large Files: May be slow on PDFs with many pages

Optimization Tips:

For pure text documents, consider pre-converting with pdfminer
For known table-heavy documents, the form detection provides best results
Processing time scales linearly with page count

Limitations

No OCR: Cannot extract text from scanned/image PDFs (use image converter with LLM instead)
No Image Extraction: Images are not extracted or converted
No Metadata: PDF metadata (author, title, etc.) is not extracted
Complex Layouts: Multi-column newspaper-style layouts may not be perfectly preserved

Get Started

Guides

File Formats

Advanced

Overview

Dependencies

Features

Smart Table Detection

Form Processing

Text Extraction

Hybrid Processing

Basic Usage

Conversion Strategies

1. Form-Style Content

2. Borderless Tables

3. Plain Text

Advanced Features

Implementation Details

Source Location

Converter Class

Key Functions

`_extract_form_content_from_words(page)`

`_extract_tables_from_words(page)`

`_merge_partial_numbering_lines(text)`

Dependencies Used

Examples

Converting a Research Paper

Converting an Invoice

Converting a Form

Error Handling

Performance Considerations

Limitations

Next Steps

Images with OCR

Document Intelligence

Build docs developers (and LLMs) love

Get Started

Guides

File Formats

Advanced

​Overview

​Dependencies

​Features

Smart Table Detection

Form Processing

Text Extraction

Hybrid Processing

​Basic Usage

​Conversion Strategies

​1. Form-Style Content

​2. Borderless Tables

​3. Plain Text

​Advanced Features

​Implementation Details

​Source Location

​Converter Class

​Key Functions

​_extract_form_content_from_words(page)

​_extract_tables_from_words(page)

​_merge_partial_numbering_lines(text)

​Dependencies Used

​Examples

​Converting a Research Paper

​Converting an Invoice

​Converting a Form

​Error Handling

​Performance Considerations

​Limitations

​Next Steps

Images with OCR

Document Intelligence

Build docs developers (and LLMs) love

Overview

Dependencies

Features

Basic Usage

Conversion Strategies

1. Form-Style Content

2. Borderless Tables

3. Plain Text

Advanced Features

Implementation Details

Source Location

Converter Class

Key Functions

`_extract_form_content_from_words(page)`

`_extract_tables_from_words(page)`

`_merge_partial_numbering_lines(text)`

Dependencies Used

Examples

Converting a Research Paper

Converting an Invoice

Converting a Form

Error Handling

Performance Considerations

Limitations

Next Steps