pdftotext command-line tool.
Quick Start
Poppler Parser (Default)
The default PDF parser uses Poppler’spdftotext command-line utility. It’s fast, reliable, and handles most PDFs well.
Installation
- macOS
- Ubuntu/Debian
- Fedora
- Docker
Verify Installation
Configuration
Options
| Option | Default | Description |
|---|---|---|
:layout | true | Preserve original text layout (columns, spacing) |
Usage
When to Use Layout Preservation
- Layout: true
- Layout: false
Best for:Output Example:
- Academic papers
- Multi-column documents
- Tables and structured data
- Documents where spatial layout matters
Custom PDF Parser
Implement theArcana.FileParser.PDF behaviour for alternative PDF parsing solutions.
Implementation Examples
- Apache PDFBox
- Cloud API
- Python PyPDF2
- Rustler NIF
Using PDFBox via Java/Rustler:Configuration:
Binary Content Support
Some parsers can process PDF binary content directly (useful for file uploads):Parser Configuration
Global Configuration
Per-Call Override
Advanced Parsing
OCR for Scanned PDFs
Handle scanned PDFs with OCR:Extracting Tables
Extract structured data from PDF tables:Metadata Extraction
Extract PDF metadata along with text:Testing PDF Parsers
Error Handling
Best Practices
- Start with Poppler - Works well for most PDFs
- Add OCR for scanned docs - Fallback when text extraction fails
- Set timeouts - Large PDFs can take time to parse
- Validate input - Check file exists and is actually a PDF
- Handle errors gracefully - Log failures but don’t crash
- Test with real PDFs - Different PDF generators produce different output
- Consider layout preservation - Enable for multi-column, disable for books
Troubleshooting
pdftotext not found
pdftotext not found
Install Poppler:Verify installation:
Extracted text is garbled
Extracted text is garbled
Try disabling layout preservation:Or use a different parser for problematic PDFs.
Scanned PDFs return empty text
Scanned PDFs return empty text
Use OCR:Requires Tesseract:
PDF parsing is slow
PDF parsing is slow
Implement async parsing with timeouts:Or use a faster parser like Rust-based solutions.
Next Steps
Chunkers
Configure text splitting after parsing
LLM Integration
Setup LLMs for question answering