Skip to main content
Flowise supports loading documents from a wide variety of sources. Extract text from files, scrape websites, connect to databases, and more.

File Formats

PDF

Extract text from PDF documents

Microsoft Word

Load .docx and .doc files

Microsoft Excel

Parse Excel spreadsheets

Microsoft PowerPoint

Extract text from presentations

CSV

Load CSV data files

JSON

Parse JSON files

JSON Lines

Load JSONL format files

Text

Plain text files

EPUB

Load EPUB ebooks

DOCX

Microsoft Word documents

Web Scraping

Cheerio

Fast HTML scraping

Puppeteer

Headless Chrome scraping

Playwright

Modern web scraping

FireCrawl

AI-powered web scraping

Spider

Web crawling service

Cloud Storage

AWS S3

Load files from S3 buckets

Google Drive

Access Google Drive files

Azure Blob Storage

Load from Azure storage

Databases & SaaS

Notion

Load Notion pages and databases

Airtable

Import Airtable data

Confluence

Load Confluence pages

Google Sheets

Access Google Sheets data

GitHub

Load GitHub repositories

GitBook

Import GitBook documentation

Jira

Load Jira issues

Figma

Extract Figma designs

Search & APIs

Brave Search

Search the web with Brave

SerpAPI

Google search results

SearchAPI

Multi-engine search API

API Loader

Load from REST APIs

Special Loaders

Folder

Load all files from a directory

File

Generic file loader

Unstructured

Universal document loader

Plain Text

Direct text input

Vector Store to Document

Load from existing vector stores

Document Store

Load from document storage

Custom Loader

Build custom loaders

Apify Website Crawler

Apify web crawler

Oxylabs

Web scraping service

Configuration Examples

PDF File

// PDF loader configuration
{
  pdfFile: "FILE-STORAGE::document.pdf",
  usage: "perPage", // or "perFile"
  legacyBuild: false,
  metadata: {
    source: "user-manual",
    version: "2.0"
  },
  omitMetadataKeys: "pdf.version, pdf.info"
}
Usage Modes:
  • perPage - One document per page
  • perFile - One document for entire PDF
Code Example:
// From Pdf.ts
const loader = new PDFLoader(new Blob([buffer]), {
  splitPages: usage === 'perFile' ? false : true,
  pdfjs: () => legacyBuild 
    ? import('pdfjs-dist/legacy/build/pdf.js')
    : import('pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js')
})

let docs = await loader.load()
if (textSplitter) {
  docs = await textSplitter.splitDocuments(docs)
}

Cheerio Web Scraper

// Cheerio configuration
{
  url: "https://docs.flowiseai.com",
  relativeLinksMethod: "webCrawl",
  limit: 10,
  selector: ".main-content", // CSS selector
  metadata: {
    type: "documentation"
  }
}
Features:
  • Web crawling for relative links
  • XML sitemap scraping
  • CSS selector filtering
  • Automatic link discovery
Code Example:
// From Cheerio.ts
const loader = new CheerioWebBaseLoader(url, {
  selector: '.main-content' // Optional CSS selector
})

let docs = await loader.load()
if (textSplitter) {
  docs = await textSplitter.splitDocuments(docs)
}

Puppeteer Web Scraper

// Puppeteer configuration
{
  url: "https://example.com",
  waitForSelector: ".content-loaded",
  relativeLinksMethod: "webCrawl",
  limit: 20,
  metadata: {
    scraped_at: "2024-01-15"
  }
}
Use Cases:
  • JavaScript-heavy websites
  • Dynamic content loading
  • Pages requiring authentication
  • Complex interactions

Notion

// Notion Page loader
{
  notionPageId: "abc123...",
  notionApiKey: "secret_...",
  metadata: {
    workspace: "engineering"
  }
}

// Notion Database loader
{
  notionDatabaseId: "def456...",
  notionApiKey: "secret_..."
}
Setup Steps:
  1. Create Notion integration at notion.so/my-integrations
  2. Get API key (Internal Integration Token)
  3. Share page/database with integration
  4. Copy page/database ID from URL

AWS S3

// S3 File loader
{
  s3Bucket: "my-documents",
  s3Key: "docs/manual.pdf",
  s3Region: "us-east-1",
  // Optional: Use IAM role or provide credentials
}

// S3 Directory loader
{
  s3Bucket: "my-documents",
  s3Prefix: "docs/",
  s3Region: "us-east-1"
}
Credential Setup:
  • AWS Access Key ID
  • AWS Secret Access Key
  • AWS Region

Google Drive

// Google Drive loader
{
  googleDriveId: "abc123...", // File or folder ID
  recursive: true, // Load subfolders
  metadata: {
    source: "google-drive"
  }
}
Setup Steps:
  1. Create project in Google Cloud Console
  2. Enable Google Drive API
  3. Create OAuth 2.0 credentials
  4. Get file/folder ID from Drive URL

Microsoft Excel

// Excel loader configuration
{
  excelFile: "FILE-STORAGE::data.xlsx",
  sheetName: "Sheet1", // Optional, loads all if not specified
  metadata: {
    source: "financial-data"
  }
}

CSV

// CSV loader configuration
{
  csvFile: "FILE-STORAGE::data.csv",
  columnMapping: {
    "text": "description",
    "metadata.category": "type"
  },
  metadata: {
    format: "csv"
  }
}

JSON

// JSON loader configuration
{
  jsonFile: "FILE-STORAGE::config.json",
  pointers: ["/data/items"], // JSONPath pointers
  metadata: {
    type: "config"
  }
}

Folder Loader

// Folder loader configuration
{
  folderPath: "/path/to/documents",
  recursive: true,
  fileTypes: [".pdf", ".txt", ".md"],
  metadata: {
    batch: "import-2024"
  }
}

API Loader

// API loader configuration
{
  apiUrl: "https://api.example.com/data",
  method: "GET",
  headers: {
    "Authorization": "Bearer token",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({ query: "documents" }),
  textPath: "data.content" // Extract text from response
}

FireCrawl

// FireCrawl configuration
{
  url: "https://example.com",
  apiKey: "fc-...",
  mode: "scrape", // or "crawl"
  limit: 100,
  metadata: {
    crawled_by: "firecrawl"
  }
}
Modes:
  • scrape - Single page
  • crawl - Multiple pages with depth

GitHub

// GitHub loader configuration
{
  repoLink: "https://github.com/FlowiseAI/Flowise",
  branch: "main",
  accessToken: "ghp_...", // Optional for private repos
  recursive: true,
  ignorePaths: ["node_modules", ".git"]
}

Confluence

// Confluence loader
{
  confluenceUrl: "https://company.atlassian.net",
  spaceKey: "ENG",
  username: "[email protected]",
  accessToken: "...",
  limit: 25
}

Airtable

// Airtable loader
{
  baseId: "app...",
  tableName: "Documents",
  apiKey: "key...",
  view: "All Records"
}

Advanced Features

Text Splitting

All loaders support text splitting:
// Connect a Text Splitter node
{
  textSplitter: textSplitterNode,
  // Documents automatically split after loading
}

Metadata Enrichment

Add custom metadata to all documents:
{
  metadata: {
    source: "documentation",
    version: "2.0",
    department: "engineering",
    indexed_at: "2024-01-15"
  }
}

Metadata Filtering

Omit default metadata keys:
{
  omitMetadataKeys: "pdf.version, loc.lines",
  // Or omit all except custom:
  omitMetadataKeys: "*"
}

Multiple Files

Load multiple files at once:
{
  pdfFile: "FILE-STORAGE::[\"doc1.pdf\",\"doc2.pdf\",\"doc3.pdf\"]"
}

Output Formats

Choose output format:
// Output as Document objects (default)
output: "document"

// Output as concatenated text
output: "text"
Web loaders can crawl related pages:
{
  url: "https://docs.example.com",
  relativeLinksMethod: "webCrawl", // or "scrapeXMLSitemap"
  limit: 50 // Max pages to crawl (0 = unlimited)
}

Unstructured API

Load any document type:
// Unstructured configuration
{
  file: "FILE-STORAGE::document.pdf",
  apiKey: "...",
  strategy: "hi_res", // or "fast", "ocr_only"
  // Supports: PDF, DOCX, PPTX, images, and more
}
Strategies:
  • hi_res - Best quality, slower
  • fast - Quick extraction
  • ocr_only - Image OCR

Best Practices

File Size Limits

// For large PDFs, use perFile mode
{
  usage: "perFile",
  // Then split with text splitter
}

Web Scraping

// Use appropriate scraper:
// - Cheerio: Fast, static content
// - Puppeteer: JavaScript-heavy sites
// - Playwright: Modern complex sites
// - FireCrawl: AI-powered, best quality

Rate Limiting

// Limit crawled pages
{
  limit: 10, // Don't crawl entire site
  timeout: 30000 // 30 second timeout
}

Error Handling

Loaders gracefully handle errors:
// Invalid URLs or files are skipped
// Check logs for details:
// "Error loading document: ..."

Troubleshooting

PDF Issues

// Try legacy build for problematic PDFs
{
  legacyBuild: true
}

Web Scraping Fails

// Use Puppeteer for JavaScript sites
// Add wait selector:
{
  waitForSelector: ".content-loaded"
}

File Not Found

Error: FILE-STORAGE:: file not found
Upload file through Flowise UI first.

Connection Timeout

{
  timeout: 60000 // Increase timeout
}

Memory Issues

// Split large documents
{
  textSplitter: textSplitterNode,
  chunkSize: 1000,
  chunkOverlap: 200
}

Performance Tips

  1. Use appropriate loader - Cheerio is fastest for static sites
  2. Limit crawling - Set reasonable limits
  3. Cache results - Don’t re-scrape unchanged content
  4. Split documents - Break large documents into chunks
  5. Filter content - Use CSS selectors to extract relevant parts
  6. Batch operations - Load multiple files efficiently

Next Steps

Text Splitters

Split documents into chunks

Embeddings

Generate embeddings from loaded documents

Vector Stores

Store loaded documents in vector databases

Build docs developers (and LLMs) love