Overview
This loader uses the LangChainPDFLoader to parse PDF files and extract their text content. It supports both single and multiple file uploads, with options to process PDFs page-by-page or as a single document.
Configuration
The PDF file(s) to load. Supports both file upload and file storage references.File Type:
.pdfOptional text splitter to chunk the extracted text into smaller pieces. Useful for processing large PDFs with LLMs that have token limits.
Controls how the PDF is processed:
Advanced Parameters
Enable legacy PDF.js build for compatibility with older or problematic PDF files.
Use this option if you encounter errors with certain PDF files that don’t parse correctly with the standard build.
Additional metadata to attach to extracted documents.
Comma-separated list of default metadata keys to exclude from the output.Example:
pdf.version, pdf.info.CreatorSpecial value: Use * to omit all default metadata and only include your custom metadata.Output
The PDF loader provides two output types:- Document
- Text
Returns an array of document objects with metadata and page content.
Usage Examples
Basic PDF Loading
Configure Processing Mode
Choose between “One document per page” or “One document per file” based on your needs.
With Text Splitting
When using text splitters with PDFs, consider setting the chunk size based on your LLM’s context window and the complexity of your PDF content.
Processing Multiple PDFs
You can upload multiple PDF files at once. Each file will be processed according to the selected usage mode:- Per Page Mode: If you upload 3 PDFs with 5 pages each, you’ll get 15 documents (one per page)
- Per File Mode: If you upload 3 PDFs, you’ll get 3 documents (one per file)
Adding Custom Metadata
Common Use Cases
Document Q&A
Load PDFs into a vector store to enable question-answering over your documents
Research Papers
Extract and index academic papers for semantic search and retrieval
Manual Processing
Parse technical manuals and handbooks for support chatbots
Contract Analysis
Extract text from legal documents for analysis and comparison
Troubleshooting
PDF fails to parse
PDF fails to parse
Try enabling the Legacy Build option in advanced parameters. Some PDFs use older formats or non-standard encodings that require the legacy PDF.js build.
Missing text from scanned PDFs
Missing text from scanned PDFs
The PDF loader extracts text that is already embedded in the PDF. For scanned documents (images of text), you’ll need to use OCR (Optical Character Recognition) preprocessing before loading the PDF.
Memory errors with large PDFs
Memory errors with large PDFs
For very large PDF files:
- Use “One document per page” mode
- Add a text splitter to chunk the content
- Consider processing the PDF in batches if it has hundreds of pages
Incorrect page numbers in metadata
Incorrect page numbers in metadata
Page numbers in the
loc.pageNumber metadata field are 1-indexed (starting from 1), matching the visual page numbers in PDF readers.Best Practices
Related Resources
Vector Stores
Store PDF content for semantic search
Document Loaders
Explore other document loader types