PDF Document Loader

The PDF Document Loader allows you to extract text content from PDF files and convert them into structured documents for use in your Flowise workflows.

Overview

This loader uses the LangChain PDFLoader to parse PDF files and extract their text content. It supports both single and multiple file uploads, with options to process PDFs page-by-page or as a single document.

Configuration

pdfFile

file

required

The PDF file(s) to load. Supports both file upload and file storage references.File Type: .pdf

textSplitter

TextSplitter

Optional text splitter to chunk the extracted text into smaller pieces. Useful for processing large PDFs with LLMs that have token limits.

usage

options

default:"perPage"

Controls how the PDF is processed:

Show Options

One document per page (perPage): Creates a separate document for each page in the PDF. Recommended for most use cases as it preserves page-level metadata.
One document per file (perFile): Creates a single document containing all pages. Useful when you want to maintain document context across pages.

Advanced Parameters

legacyBuild

boolean

Enable legacy PDF.js build for compatibility with older or problematic PDF files.

Use this option if you encounter errors with certain PDF files that don’t parse correctly with the standard build.

metadata

json

Additional metadata to attach to extracted documents.

{
  "source": "company_handbook",
  "category": "HR",
  "version": "2024"
}

omitMetadataKeys

string

Comma-separated list of default metadata keys to exclude from the output.Example: pdf.version, pdf.info.CreatorSpecial value: Use * to omit all default metadata and only include your custom metadata.

Output

The PDF loader provides two output types:

Document
Text

Returns an array of document objects with metadata and page content.

[
  {
    "pageContent": "This is the content of page 1...",
    "metadata": {
      "source": "blob",
      "pdf": {
        "version": "1.10.100",
        "info": {
          "PDFFormatVersion": "1.4",
          "IsAcroFormPresent": false,
          "IsXFAPresent": false,
          "Title": "Document Title"
        },
        "metadata": null,
        "totalPages": 10
      },
      "loc": {
        "pageNumber": 1
      }
    }
  },
  {
    "pageContent": "This is the content of page 2...",
    "metadata": {
      "source": "blob",
      "loc": {
        "pageNumber": 2
      }
    }
  }
]

Returns concatenated text from all pages as a single string.

This is the content of page 1...
This is the content of page 2...
This is the content of page 3...

Usage Examples

Basic PDF Loading

Add PDF Loader Node

Drag the PDF File node from the Document Loaders category onto your canvas.

Upload PDF

Click on the file input and select your PDF file(s) to upload.

Configure Processing Mode

Choose between “One document per page” or “One document per file” based on your needs.

Connect to Downstream Nodes

Connect the PDF loader output to vector stores, text splitters, or other processing nodes.

With Text Splitting

// Connect nodes in this order:
// 1. PDF File Loader
// 2. Recursive Character Text Splitter (chunk size: 1000)
// 3. Pinecone Vector Store
// 4. Conversational Retrieval Chain

// The text splitter will automatically chunk the PDF content
// into manageable pieces for embedding and retrieval

When using text splitters with PDFs, consider setting the chunk size based on your LLM’s context window and the complexity of your PDF content.

Processing Multiple PDFs

You can upload multiple PDF files at once. Each file will be processed according to the selected usage mode:

Per Page Mode: If you upload 3 PDFs with 5 pages each, you’ll get 15 documents (one per page)
Per File Mode: If you upload 3 PDFs, you’ll get 3 documents (one per file)

Adding Custom Metadata

{
  "department": "Engineering",
  "document_type": "Technical Specification",
  "reviewed_by": "John Doe",
  "review_date": "2024-01-15"
}

This metadata will be merged with the default PDF metadata for all extracted documents.

Common Use Cases

Document Q&A

Load PDFs into a vector store to enable question-answering over your documents

Research Papers

Extract and index academic papers for semantic search and retrieval

Manual Processing

Parse technical manuals and handbooks for support chatbots

Contract Analysis

Extract text from legal documents for analysis and comparison

Troubleshooting

PDF fails to parse

Try enabling the Legacy Build option in advanced parameters. Some PDFs use older formats or non-standard encodings that require the legacy PDF.js build.

Missing text from scanned PDFs

The PDF loader extracts text that is already embedded in the PDF. For scanned documents (images of text), you’ll need to use OCR (Optical Character Recognition) preprocessing before loading the PDF.

Memory errors with large PDFs

For very large PDF files:

Use “One document per page” mode
Add a text splitter to chunk the content
Consider processing the PDF in batches if it has hundreds of pages

Incorrect page numbers in metadata

Page numbers in the loc.pageNumber metadata field are 1-indexed (starting from 1), matching the visual page numbers in PDF readers.

Best Practices

Performance Considerations

Large PDFs (>100 pages) can take significant time to process
Processing per-page creates more documents but preserves page context
Always use text splitters when feeding content to LLMs to stay within token limits

Optimization Tips

Use per-page mode when page boundaries are meaningful (e.g., books, reports)
Use per-file mode when document flow across pages is important (e.g., contracts)
Add source tracking metadata to help identify content origin during retrieval

Vector Stores

Store PDF content for semantic search

Document Loaders

Explore other document loader types

Overview

Language Models

Vector Stores

Document Loaders

Agents & Tools

PDF Document Loader

Overview

Configuration

Advanced Parameters

Output

Usage Examples

Basic PDF Loading

With Text Splitting

Processing Multiple PDFs

Adding Custom Metadata

Common Use Cases

Document Q&A

Research Papers

Manual Processing

Contract Analysis

Troubleshooting

Best Practices

Vector Stores

Document Loaders

Build docs developers (and LLMs) love

Overview

Language Models

Vector Stores

Document Loaders

Agents & Tools

​Overview

​Configuration

​Advanced Parameters

​Output

​Usage Examples

​Basic PDF Loading

​With Text Splitting

​Processing Multiple PDFs

​Adding Custom Metadata

​Common Use Cases

Document Q&A

Research Papers

Manual Processing

Contract Analysis

​Troubleshooting

​Best Practices

​Related Resources

Vector Stores

Document Loaders

Build docs developers (and LLMs) love

Overview

Configuration

Advanced Parameters

Output

Usage Examples

Basic PDF Loading

With Text Splitting

Processing Multiple PDFs

Adding Custom Metadata

Common Use Cases

Troubleshooting

Best Practices

Related Resources