Document Loaders Overview

Document Loaders are essential components in Flowise that enable you to load and process data from various sources. They transform raw data into a structured format that can be used by LLMs and vector stores.

What are Document Loaders?

Document Loaders extract content from different file types and sources, converting them into Document objects that contain:

pageContent: The extracted text content
metadata: Information about the source (filename, page number, URL, etc.)

Available Document Loaders

Flowise supports a wide variety of document loaders across different categories:

File-Based Loaders

PDF File

Load and extract text from PDF documents

Text File

Load text-based files including .txt, .md, .html, and code files

File Loader

Generic loader supporting multiple file types (PDF, DOCX, CSV, JSON, etc.)

CSV

Load and parse CSV files into documents

Web-Based Loaders

Cheerio Web Scraper

Scrape and extract content from web pages using CSS selectors

Puppeteer Web Scraper

Advanced web scraping with JavaScript execution support

Playwright

Cross-browser web scraping with full page rendering

FireCrawl

Modern web scraping and crawling service

Cloud Storage Loaders

Google Drive

Load files from Google Drive

S3 File

Load files from Amazon S3 buckets

S3 Directory

Load multiple files from S3 directories

Productivity Tool Loaders

Notion

Load pages, databases, and folders from Notion

Confluence

Load content from Atlassian Confluence

Jira

Load issues and project data from Jira

Airtable

Load records from Airtable bases

Microsoft Office Loaders

Word

Load .doc and .docx files

Excel

Load spreadsheet data

PowerPoint

Load presentation content

Common Configuration Options

Most document loaders share these common configuration options:

textSplitter

TextSplitter

Optional text splitter to chunk documents into smaller pieces for processing

metadata

json

Additional metadata to attach to all extracted documents

{
  "source": "internal_docs",
  "department": "engineering"
}

omitMetadataKeys

string

Comma-separated list of metadata keys to exclude from the output. Use * to omit all default metadata keys except those specified in Additional Metadata.Example: key1, key2, key3.nestedKey1

Output Options

Document loaders typically provide two output types:

Document Output

Returns an array of document objects with metadata and pageContent. Use this when you need to preserve document structure and metadata.

[
  {
    "pageContent": "This is the text content...",
    "metadata": {
      "source": "document.pdf",
      "page": 1
    }
  }
]

Text Output

Returns concatenated text from all documents. Use this for simple text processing workflows.

This is the text content from page 1...
This is the text content from page 2...

Best Practices

Performance Tips

Use text splitters to chunk large documents for better LLM processing
Set appropriate limits when crawling websites to avoid long processing times
Use metadata to track document sources for better retrieval

Important Considerations

Large files may take significant time to process
Some loaders require API keys or authentication credentials
Web scraping should respect robots.txt and website terms of service

Using Document Loaders in Workflows

Document loaders are typically used in these scenarios:

RAG (Retrieval Augmented Generation): Load documents to create a knowledge base
Data Ingestion: Import data from various sources into vector stores
Content Processing: Extract and transform content for analysis
Knowledge Base Creation: Build searchable document repositories

Next Steps

PDF Loader

Learn how to load PDF documents

Web Scraper

Extract content from websites

Vector Stores

Store and search documents

Overview

Language Models

Vector Stores

Document Loaders

Agents & Tools

Document Loaders Overview

What are Document Loaders?

Available Document Loaders

File-Based Loaders

PDF File

Text File

File Loader

CSV

Web-Based Loaders

Cheerio Web Scraper

Puppeteer Web Scraper

Playwright

FireCrawl

Cloud Storage Loaders

Google Drive

S3 File

S3 Directory

Productivity Tool Loaders

Notion

Confluence

Jira

Airtable

Microsoft Office Loaders

Word

Excel

PowerPoint

Common Configuration Options

Output Options

Best Practices

Using Document Loaders in Workflows

Next Steps

PDF Loader

Web Scraper

Vector Stores

Build docs developers (and LLMs) love

Overview

Language Models

Vector Stores

Document Loaders

Agents & Tools

​What are Document Loaders?

​Available Document Loaders

​File-Based Loaders

PDF File

Text File

File Loader

CSV

​Web-Based Loaders

Cheerio Web Scraper

Puppeteer Web Scraper

Playwright

FireCrawl

​Cloud Storage Loaders

Google Drive

S3 File

S3 Directory

​Productivity Tool Loaders

Notion

Confluence

Jira

Airtable

​Microsoft Office Loaders

Word

Excel

PowerPoint

​Common Configuration Options

​Output Options

​Best Practices

​Using Document Loaders in Workflows

​Next Steps

PDF Loader

Web Scraper

Vector Stores

Build docs developers (and LLMs) love

What are Document Loaders?

Available Document Loaders

File-Based Loaders

Web-Based Loaders

Cloud Storage Loaders

Productivity Tool Loaders

Microsoft Office Loaders

Common Configuration Options

Output Options

Best Practices

Using Document Loaders in Workflows

Next Steps