File Formats
Extract text from PDF documents
Microsoft Word
Load .docx and .doc files
Microsoft Excel
Parse Excel spreadsheets
Microsoft PowerPoint
Extract text from presentations
CSV
Load CSV data files
JSON
Parse JSON files
JSON Lines
Load JSONL format files
Text
Plain text files
EPUB
Load EPUB ebooks
DOCX
Microsoft Word documents
Web Scraping
Cheerio
Fast HTML scraping
Puppeteer
Headless Chrome scraping
Playwright
Modern web scraping
FireCrawl
AI-powered web scraping
Spider
Web crawling service
Cloud Storage
AWS S3
Load files from S3 buckets
Google Drive
Access Google Drive files
Azure Blob Storage
Load from Azure storage
Databases & SaaS
Notion
Load Notion pages and databases
Airtable
Import Airtable data
Confluence
Load Confluence pages
Google Sheets
Access Google Sheets data
GitHub
Load GitHub repositories
GitBook
Import GitBook documentation
Jira
Load Jira issues
Figma
Extract Figma designs
Search & APIs
Brave Search
Search the web with Brave
SerpAPI
Google search results
SearchAPI
Multi-engine search API
API Loader
Load from REST APIs
Special Loaders
Folder
Load all files from a directory
File
Generic file loader
Unstructured
Universal document loader
Plain Text
Direct text input
Vector Store to Document
Load from existing vector stores
Document Store
Load from document storage
Custom Loader
Build custom loaders
Apify Website Crawler
Apify web crawler
Oxylabs
Web scraping service
Configuration Examples
PDF File
perPage- One document per pageperFile- One document for entire PDF
Cheerio Web Scraper
- Web crawling for relative links
- XML sitemap scraping
- CSS selector filtering
- Automatic link discovery
Puppeteer Web Scraper
- JavaScript-heavy websites
- Dynamic content loading
- Pages requiring authentication
- Complex interactions
Notion
- Create Notion integration at notion.so/my-integrations
- Get API key (Internal Integration Token)
- Share page/database with integration
- Copy page/database ID from URL
AWS S3
- AWS Access Key ID
- AWS Secret Access Key
- AWS Region
Google Drive
- Create project in Google Cloud Console
- Enable Google Drive API
- Create OAuth 2.0 credentials
- Get file/folder ID from Drive URL
Microsoft Excel
CSV
JSON
Folder Loader
API Loader
FireCrawl
scrape- Single pagecrawl- Multiple pages with depth
GitHub
Confluence
Airtable
Advanced Features
Text Splitting
All loaders support text splitting:Metadata Enrichment
Add custom metadata to all documents:Metadata Filtering
Omit default metadata keys:Multiple Files
Load multiple files at once:Output Formats
Choose output format:Relative Links Crawling
Web loaders can crawl related pages:Unstructured API
Load any document type:hi_res- Best quality, slowerfast- Quick extractionocr_only- Image OCR
Best Practices
File Size Limits
Web Scraping
Rate Limiting
Error Handling
Loaders gracefully handle errors:Troubleshooting
PDF Issues
Web Scraping Fails
File Not Found
Connection Timeout
Memory Issues
Performance Tips
- Use appropriate loader - Cheerio is fastest for static sites
- Limit crawling - Set reasonable limits
- Cache results - Don’t re-scrape unchanged content
- Split documents - Break large documents into chunks
- Filter content - Use CSS selectors to extract relevant parts
- Batch operations - Load multiple files efficiently
Next Steps
Text Splitters
Split documents into chunks
Embeddings
Generate embeddings from loaded documents
Vector Stores
Store loaded documents in vector databases
