Skip to main content
The PDF Document Loader allows you to extract text content from PDF files and convert them into structured documents for use in your Flowise workflows.

Overview

This loader uses the LangChain PDFLoader to parse PDF files and extract their text content. It supports both single and multiple file uploads, with options to process PDFs page-by-page or as a single document.

Configuration

pdfFile
file
required
The PDF file(s) to load. Supports both file upload and file storage references.File Type: .pdf
textSplitter
TextSplitter
Optional text splitter to chunk the extracted text into smaller pieces. Useful for processing large PDFs with LLMs that have token limits.
usage
options
default:"perPage"
Controls how the PDF is processed:

Advanced Parameters

legacyBuild
boolean
Enable legacy PDF.js build for compatibility with older or problematic PDF files.
Use this option if you encounter errors with certain PDF files that don’t parse correctly with the standard build.
metadata
json
Additional metadata to attach to extracted documents.
{
  "source": "company_handbook",
  "category": "HR",
  "version": "2024"
}
omitMetadataKeys
string
Comma-separated list of default metadata keys to exclude from the output.Example: pdf.version, pdf.info.CreatorSpecial value: Use * to omit all default metadata and only include your custom metadata.

Output

The PDF loader provides two output types:
Returns an array of document objects with metadata and page content.
[
  {
    "pageContent": "This is the content of page 1...",
    "metadata": {
      "source": "blob",
      "pdf": {
        "version": "1.10.100",
        "info": {
          "PDFFormatVersion": "1.4",
          "IsAcroFormPresent": false,
          "IsXFAPresent": false,
          "Title": "Document Title"
        },
        "metadata": null,
        "totalPages": 10
      },
      "loc": {
        "pageNumber": 1
      }
    }
  },
  {
    "pageContent": "This is the content of page 2...",
    "metadata": {
      "source": "blob",
      "loc": {
        "pageNumber": 2
      }
    }
  }
]

Usage Examples

Basic PDF Loading

1

Add PDF Loader Node

Drag the PDF File node from the Document Loaders category onto your canvas.
2

Upload PDF

Click on the file input and select your PDF file(s) to upload.
3

Configure Processing Mode

Choose between “One document per page” or “One document per file” based on your needs.
4

Connect to Downstream Nodes

Connect the PDF loader output to vector stores, text splitters, or other processing nodes.

With Text Splitting

// Connect nodes in this order:
// 1. PDF File Loader
// 2. Recursive Character Text Splitter (chunk size: 1000)
// 3. Pinecone Vector Store
// 4. Conversational Retrieval Chain

// The text splitter will automatically chunk the PDF content
// into manageable pieces for embedding and retrieval
When using text splitters with PDFs, consider setting the chunk size based on your LLM’s context window and the complexity of your PDF content.

Processing Multiple PDFs

You can upload multiple PDF files at once. Each file will be processed according to the selected usage mode:
  • Per Page Mode: If you upload 3 PDFs with 5 pages each, you’ll get 15 documents (one per page)
  • Per File Mode: If you upload 3 PDFs, you’ll get 3 documents (one per file)

Adding Custom Metadata

{
  "department": "Engineering",
  "document_type": "Technical Specification",
  "reviewed_by": "John Doe",
  "review_date": "2024-01-15"
}
This metadata will be merged with the default PDF metadata for all extracted documents.

Common Use Cases

Document Q&A

Load PDFs into a vector store to enable question-answering over your documents

Research Papers

Extract and index academic papers for semantic search and retrieval

Manual Processing

Parse technical manuals and handbooks for support chatbots

Contract Analysis

Extract text from legal documents for analysis and comparison

Troubleshooting

Try enabling the Legacy Build option in advanced parameters. Some PDFs use older formats or non-standard encodings that require the legacy PDF.js build.
The PDF loader extracts text that is already embedded in the PDF. For scanned documents (images of text), you’ll need to use OCR (Optical Character Recognition) preprocessing before loading the PDF.
For very large PDF files:
  1. Use “One document per page” mode
  2. Add a text splitter to chunk the content
  3. Consider processing the PDF in batches if it has hundreds of pages
Page numbers in the loc.pageNumber metadata field are 1-indexed (starting from 1), matching the visual page numbers in PDF readers.

Best Practices

Performance Considerations
  • Large PDFs (>100 pages) can take significant time to process
  • Processing per-page creates more documents but preserves page context
  • Always use text splitters when feeding content to LLMs to stay within token limits
Optimization Tips
  • Use per-page mode when page boundaries are meaningful (e.g., books, reports)
  • Use per-file mode when document flow across pages is important (e.g., contracts)
  • Add source tracking metadata to help identify content origin during retrieval

Vector Stores

Store PDF content for semantic search

Document Loaders

Explore other document loader types

Build docs developers (and LLMs) love