PDF Connector
The PDF connector extracts text from PDF documents, supporting both local files and remote URLs with glob pattern matching.Import
Two Variants
The package provides two PDF connectors:pdf(pattern)- Glob pattern matching for multiple PDFspdfFile(source)- Single PDF from file path or URL
PDF Pattern Matching
Ingest multiple PDFs using glob patterns:Basic Usage
Pattern Examples
Source ID
Excluded Directories
These directories are automatically excluded:**/node_modules/****/.git/**
Single PDF File
Ingest a single PDF from a file path or URL:Local File
Remote URL
Source ID
Text Extraction
Both connectors use theunpdf library for text extraction:
Merged Pages
Pages are automatically merged into a single text document:Document Format
Extracted text is ingested as-is:Examples
Research Papers
User Manual
Remote PDF
Multiple PDFs
Performance
File Validation
Only.pdf files are processed:
Memory Usage
PDFs are loaded into memory for processing:Network Requests
Remote PDFs are downloaded completely:Error Handling
Invalid PDFs
HTTP Errors
File Not Found
Document IDs
Pattern Matching
Document IDs are file paths:Single File
Document ID is the source:Remote URL
Document ID is the URL:Extraction Quality
Text extraction quality depends on the PDF: Good Quality- Text-based PDFs (searchable)
- Well-structured documents
- Standard fonts
- Scanned images (requires OCR, not supported)
- Complex layouts
- Heavy graphics
No OCR Support
The connector does not perform OCR on scanned PDFs. Only text-based PDFs are supported.Chunking
PDF text is chunked using the default text splitter:Best Practices
Validate PDFs Ensure PDFs are text-based, not scanned images. Use Specific Patterns Be specific to avoid processing unnecessary files:Limitations
No OCR Scanned PDFs require OCR, which is not supported. Memory Usage Large PDFs are loaded entirely into memory. Layout Preservation Complex layouts may not extract well. Text order may be incorrect. Images and Graphics Images are ignored. Only text is extracted.Comparison
| Feature | pdf(pattern) | pdfFile(source) |
|---|---|---|
| Multiple files | Yes | No |
| Glob patterns | Yes | No |
| Local files | Yes | Yes |
| Remote URLs | No | Yes |
| Excluded dirs | Yes | No |
| Use case | Batch processing | Single document |
Next Steps
Local Files
Work with local files
Linear Connector
Ingest Linear issues
Ingestion
Learn about ingestion