Document Loaders

Flowise supports loading documents from a wide variety of sources. Extract text from files, scrape websites, connect to databases, and more.

File Formats

PDF

Extract text from PDF documents

Microsoft Word

Load .docx and .doc files

Microsoft Excel

Parse Excel spreadsheets

Microsoft PowerPoint

Extract text from presentations

CSV

Load CSV data files

JSON

Parse JSON files

JSON Lines

Load JSONL format files

Text

Plain text files

EPUB

Load EPUB ebooks

DOCX

Microsoft Word documents

Web Scraping

Cheerio

Fast HTML scraping

Puppeteer

Headless Chrome scraping

Playwright

Modern web scraping

FireCrawl

AI-powered web scraping

Spider

Web crawling service

Cloud Storage

AWS S3

Load files from S3 buckets

Google Drive

Access Google Drive files

Azure Blob Storage

Load from Azure storage

Databases & SaaS

Notion

Load Notion pages and databases

Airtable

Import Airtable data

Confluence

Load Confluence pages

Google Sheets

Access Google Sheets data

GitHub

Load GitHub repositories

GitBook

Import GitBook documentation

Jira

Load Jira issues

Figma

Extract Figma designs

Search & APIs

Brave Search

Search the web with Brave

SerpAPI

Google search results

SearchAPI

Multi-engine search API

API Loader

Load from REST APIs

Special Loaders

Folder

Load all files from a directory

File

Generic file loader

Unstructured

Universal document loader

Plain Text

Direct text input

Vector Store to Document

Load from existing vector stores

Document Store

Load from document storage

Custom Loader

Build custom loaders

Apify Website Crawler

Apify web crawler

Oxylabs

Web scraping service

Configuration Examples

PDF File

// PDF loader configuration
{
  pdfFile: "FILE-STORAGE::document.pdf",
  usage: "perPage", // or "perFile"
  legacyBuild: false,
  metadata: {
    source: "user-manual",
    version: "2.0"
  },
  omitMetadataKeys: "pdf.version, pdf.info"
}

Usage Modes:

perPage - One document per page
perFile - One document for entire PDF

Code Example:

// From Pdf.ts
const loader = new PDFLoader(new Blob([buffer]), {
  splitPages: usage === 'perFile' ? false : true,
  pdfjs: () => legacyBuild 
    ? import('pdfjs-dist/legacy/build/pdf.js')
    : import('pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js')
})

let docs = await loader.load()
if (textSplitter) {
  docs = await textSplitter.splitDocuments(docs)
}

Cheerio Web Scraper

// Cheerio configuration
{
  url: "https://docs.flowiseai.com",
  relativeLinksMethod: "webCrawl",
  limit: 10,
  selector: ".main-content", // CSS selector
  metadata: {
    type: "documentation"
  }
}

Features:

Web crawling for relative links
XML sitemap scraping
CSS selector filtering
Automatic link discovery

Code Example:

// From Cheerio.ts
const loader = new CheerioWebBaseLoader(url, {
  selector: '.main-content' // Optional CSS selector
})

let docs = await loader.load()
if (textSplitter) {
  docs = await textSplitter.splitDocuments(docs)
}

Puppeteer Web Scraper

// Puppeteer configuration
{
  url: "https://example.com",
  waitForSelector: ".content-loaded",
  relativeLinksMethod: "webCrawl",
  limit: 20,
  metadata: {
    scraped_at: "2024-01-15"
  }
}

Use Cases:

JavaScript-heavy websites
Dynamic content loading
Pages requiring authentication
Complex interactions

Notion

// Notion Page loader
{
  notionPageId: "abc123...",
  notionApiKey: "secret_...",
  metadata: {
    workspace: "engineering"
  }
}

// Notion Database loader
{
  notionDatabaseId: "def456...",
  notionApiKey: "secret_..."
}

Setup Steps:

Create Notion integration at notion.so/my-integrations
Get API key (Internal Integration Token)
Share page/database with integration
Copy page/database ID from URL

AWS S3

// S3 File loader
{
  s3Bucket: "my-documents",
  s3Key: "docs/manual.pdf",
  s3Region: "us-east-1",
  // Optional: Use IAM role or provide credentials
}

// S3 Directory loader
{
  s3Bucket: "my-documents",
  s3Prefix: "docs/",
  s3Region: "us-east-1"
}

Credential Setup:

AWS Access Key ID
AWS Secret Access Key
AWS Region

Google Drive

// Google Drive loader
{
  googleDriveId: "abc123...", // File or folder ID
  recursive: true, // Load subfolders
  metadata: {
    source: "google-drive"
  }
}

Setup Steps:

Create project in Google Cloud Console
Enable Google Drive API
Create OAuth 2.0 credentials
Get file/folder ID from Drive URL

Microsoft Excel

// Excel loader configuration
{
  excelFile: "FILE-STORAGE::data.xlsx",
  sheetName: "Sheet1", // Optional, loads all if not specified
  metadata: {
    source: "financial-data"
  }
}

CSV

// CSV loader configuration
{
  csvFile: "FILE-STORAGE::data.csv",
  columnMapping: {
    "text": "description",
    "metadata.category": "type"
  },
  metadata: {
    format: "csv"
  }
}

JSON

// JSON loader configuration
{
  jsonFile: "FILE-STORAGE::config.json",
  pointers: ["/data/items"], // JSONPath pointers
  metadata: {
    type: "config"
  }
}

Folder Loader

// Folder loader configuration
{
  folderPath: "/path/to/documents",
  recursive: true,
  fileTypes: [".pdf", ".txt", ".md"],
  metadata: {
    batch: "import-2024"
  }
}

API Loader

// API loader configuration
{
  apiUrl: "https://api.example.com/data",
  method: "GET",
  headers: {
    "Authorization": "Bearer token",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({ query: "documents" }),
  textPath: "data.content" // Extract text from response
}

FireCrawl

// FireCrawl configuration
{
  url: "https://example.com",
  apiKey: "fc-...",
  mode: "scrape", // or "crawl"
  limit: 100,
  metadata: {
    crawled_by: "firecrawl"
  }
}

Modes:

scrape - Single page
crawl - Multiple pages with depth

GitHub

// GitHub loader configuration
{
  repoLink: "https://github.com/FlowiseAI/Flowise",
  branch: "main",
  accessToken: "ghp_...", // Optional for private repos
  recursive: true,
  ignorePaths: ["node_modules", ".git"]
}

Confluence

// Confluence loader
{
  confluenceUrl: "https://company.atlassian.net",
  spaceKey: "ENG",
  username: "[email protected]",
  accessToken: "...",
  limit: 25
}

Airtable

// Airtable loader
{
  baseId: "app...",
  tableName: "Documents",
  apiKey: "key...",
  view: "All Records"
}

Advanced Features

Text Splitting

All loaders support text splitting:

// Connect a Text Splitter node
{
  textSplitter: textSplitterNode,
  // Documents automatically split after loading
}

Metadata Enrichment

Add custom metadata to all documents:

{
  metadata: {
    source: "documentation",
    version: "2.0",
    department: "engineering",
    indexed_at: "2024-01-15"
  }
}

Metadata Filtering

Omit default metadata keys:

{
  omitMetadataKeys: "pdf.version, loc.lines",
  // Or omit all except custom:
  omitMetadataKeys: "*"
}

Multiple Files

Load multiple files at once:

{
  pdfFile: "FILE-STORAGE::[\"doc1.pdf\",\"doc2.pdf\",\"doc3.pdf\"]"
}

Output Formats

Choose output format:

// Output as Document objects (default)
output: "document"

// Output as concatenated text
output: "text"

Relative Links Crawling

Web loaders can crawl related pages:

{
  url: "https://docs.example.com",
  relativeLinksMethod: "webCrawl", // or "scrapeXMLSitemap"
  limit: 50 // Max pages to crawl (0 = unlimited)
}

Unstructured API

Load any document type:

// Unstructured configuration
{
  file: "FILE-STORAGE::document.pdf",
  apiKey: "...",
  strategy: "hi_res", // or "fast", "ocr_only"
  // Supports: PDF, DOCX, PPTX, images, and more
}

Strategies:

hi_res - Best quality, slower
fast - Quick extraction
ocr_only - Image OCR

Best Practices

File Size Limits

// For large PDFs, use perFile mode
{
  usage: "perFile",
  // Then split with text splitter
}

Web Scraping

// Use appropriate scraper:
// - Cheerio: Fast, static content
// - Puppeteer: JavaScript-heavy sites
// - Playwright: Modern complex sites
// - FireCrawl: AI-powered, best quality

Rate Limiting

// Limit crawled pages
{
  limit: 10, // Don't crawl entire site
  timeout: 30000 // 30 second timeout
}

Error Handling

Loaders gracefully handle errors:

// Invalid URLs or files are skipped
// Check logs for details:
// "Error loading document: ..."

Troubleshooting

PDF Issues

// Try legacy build for problematic PDFs
{
  legacyBuild: true
}

Web Scraping Fails

// Use Puppeteer for JavaScript sites
// Add wait selector:
{
  waitForSelector: ".content-loaded"
}

File Not Found

Error: FILE-STORAGE:: file not found

Upload file through Flowise UI first.

Connection Timeout

{
  timeout: 60000 // Increase timeout
}

Memory Issues

// Split large documents
{
  textSplitter: textSplitterNode,
  chunkSize: 1000,
  chunkOverlap: 200
}

Performance Tips

Use appropriate loader - Cheerio is fastest for static sites
Limit crawling - Set reasonable limits
Cache results - Don’t re-scrape unchanged content
Split documents - Break large documents into chunks
Filter content - Use CSS selectors to extract relevant parts
Batch operations - Load multiple files efficiently

Next Steps

Text Splitters

Split documents into chunks

Embeddings

Generate embeddings from loaded documents

Vector Stores

Store loaded documents in vector databases

Get Started

Core Concepts

Building Workflows

Integrations

Features

Deployment

Development

​File Formats

PDF

Microsoft Word

Microsoft Excel

Microsoft PowerPoint

CSV

JSON

JSON Lines

Text

EPUB

DOCX

​Web Scraping

Cheerio

Puppeteer

Playwright

FireCrawl

Spider

​Cloud Storage

AWS S3

Google Drive

Azure Blob Storage

​Databases & SaaS

Notion

Airtable

Confluence

Google Sheets

GitHub

GitBook

Jira

Figma

​Search & APIs

Brave Search

SerpAPI

SearchAPI

API Loader

​Special Loaders

Folder

File

Unstructured

Plain Text

Vector Store to Document

Document Store

Custom Loader

Apify Website Crawler

Oxylabs

​Configuration Examples

​PDF File

​Cheerio Web Scraper

​Puppeteer Web Scraper

​Notion

​AWS S3

​Google Drive

​Microsoft Excel

​CSV

​JSON

​Folder Loader

​API Loader

​FireCrawl

​GitHub

​Confluence

​Airtable

​Advanced Features

​Text Splitting

​Metadata Enrichment

​Metadata Filtering

​Multiple Files

​Output Formats

​Relative Links Crawling

​Unstructured API

​Best Practices

​File Size Limits

​Web Scraping

​Rate Limiting

File Formats

Web Scraping

Cloud Storage

Databases & SaaS

Search & APIs

Special Loaders

Configuration Examples

PDF File

Cheerio Web Scraper

Puppeteer Web Scraper

Notion

AWS S3

Google Drive

Microsoft Excel

CSV

JSON

Folder Loader

API Loader

FireCrawl

GitHub

Confluence

Airtable

Advanced Features

Text Splitting

Metadata Enrichment

Metadata Filtering

Multiple Files

Output Formats

Relative Links Crawling

Unstructured API

Best Practices

File Size Limits

Web Scraping

Rate Limiting

Error Handling

Troubleshooting

PDF Issues

Web Scraping Fails

File Not Found

Connection Timeout

Memory Issues

Performance Tips

Next Steps