Skip to main content
Document Loaders are essential components in Flowise that enable you to load and process data from various sources. They transform raw data into a structured format that can be used by LLMs and vector stores.

What are Document Loaders?

Document Loaders extract content from different file types and sources, converting them into Document objects that contain:
  • pageContent: The extracted text content
  • metadata: Information about the source (filename, page number, URL, etc.)

Available Document Loaders

Flowise supports a wide variety of document loaders across different categories:

File-Based Loaders

PDF File

Load and extract text from PDF documents

Text File

Load text-based files including .txt, .md, .html, and code files

File Loader

Generic loader supporting multiple file types (PDF, DOCX, CSV, JSON, etc.)

CSV

Load and parse CSV files into documents

Web-Based Loaders

Cheerio Web Scraper

Scrape and extract content from web pages using CSS selectors

Puppeteer Web Scraper

Advanced web scraping with JavaScript execution support

Playwright

Cross-browser web scraping with full page rendering

FireCrawl

Modern web scraping and crawling service

Cloud Storage Loaders

Google Drive

Load files from Google Drive

S3 File

Load files from Amazon S3 buckets

S3 Directory

Load multiple files from S3 directories

Productivity Tool Loaders

Notion

Load pages, databases, and folders from Notion

Confluence

Load content from Atlassian Confluence

Jira

Load issues and project data from Jira

Airtable

Load records from Airtable bases

Microsoft Office Loaders

Word

Load .doc and .docx files

Excel

Load spreadsheet data

PowerPoint

Load presentation content

Common Configuration Options

Most document loaders share these common configuration options:
textSplitter
TextSplitter
Optional text splitter to chunk documents into smaller pieces for processing
metadata
json
Additional metadata to attach to all extracted documents
{
  "source": "internal_docs",
  "department": "engineering"
}
omitMetadataKeys
string
Comma-separated list of metadata keys to exclude from the output. Use * to omit all default metadata keys except those specified in Additional Metadata.Example: key1, key2, key3.nestedKey1

Output Options

Document loaders typically provide two output types:
1

Document Output

Returns an array of document objects with metadata and pageContent. Use this when you need to preserve document structure and metadata.
[
  {
    "pageContent": "This is the text content...",
    "metadata": {
      "source": "document.pdf",
      "page": 1
    }
  }
]
2

Text Output

Returns concatenated text from all documents. Use this for simple text processing workflows.
This is the text content from page 1...
This is the text content from page 2...

Best Practices

Performance Tips
  • Use text splitters to chunk large documents for better LLM processing
  • Set appropriate limits when crawling websites to avoid long processing times
  • Use metadata to track document sources for better retrieval
Important Considerations
  • Large files may take significant time to process
  • Some loaders require API keys or authentication credentials
  • Web scraping should respect robots.txt and website terms of service

Using Document Loaders in Workflows

Document loaders are typically used in these scenarios:
  1. RAG (Retrieval Augmented Generation): Load documents to create a knowledge base
  2. Data Ingestion: Import data from various sources into vector stores
  3. Content Processing: Extract and transform content for analysis
  4. Knowledge Base Creation: Build searchable document repositories

Next Steps

PDF Loader

Learn how to load PDF documents

Web Scraper

Extract content from websites

Vector Stores

Store and search documents

Build docs developers (and LLMs) love