Document objects. Each document contains pageContent (the text) and metadata (information about the document).
Installation
File System Loaders
Load documents from local files or file systems.CSV Loader
Load CSV files with each row as a document.@langchain/community/document_loaders/fs/csv
PDF Loader
Load PDF files and extract text content.@langchain/community/document_loaders/fs/pdfRequires:
pdf-parse
DOCX Loader
Load Microsoft Word documents.@langchain/community/document_loaders/fs/docxRequires:
mammoth
EPUB Loader
Load EPUB ebook files. Module:@langchain/community/document_loaders/fs/epubRequires:
epub2
PPTX Loader
Load PowerPoint presentations. Module:@langchain/community/document_loaders/fs/pptxRequires:
officeparser
SRT Loader
Load subtitle files (SRT format). Module:@langchain/community/document_loaders/fs/srtRequires:
srt-parser-2
Notion Loader
Load exported Notion pages from the file system. Module:@langchain/community/document_loaders/fs/notion
Obsidian Loader
Load Obsidian markdown vaults. Module:@langchain/community/document_loaders/fs/obsidian
ChatGPT Loader
Load ChatGPT conversation exports. Module:@langchain/community/document_loaders/fs/chatgpt
OpenAI Whisper Audio Loader
Transcribe audio files using OpenAI’s Whisper API. Module:@langchain/community/document_loaders/fs/openai_whisper_audioRequires: OpenAI API key
Unstructured Loader
Load various file types using the Unstructured API. Module:@langchain/community/document_loaders/fs/unstructured
Web Loaders
Load documents from web pages and online sources.Cheerio Web Loader
Load and parse HTML from URLs using Cheerio.@langchain/community/document_loaders/web/cheerioRequires:
cheerio
Puppeteer Loader
Load web pages using Puppeteer (headless Chrome).@langchain/community/document_loaders/web/puppeteerRequires:
puppeteer
Playwright Loader
Load web pages using Playwright. Module:@langchain/community/document_loaders/web/playwrightRequires:
playwright
FireCrawl Loader
Load web pages using the FireCrawl API for advanced web scraping. Module:@langchain/community/document_loaders/web/firecrawlRequires:
@mendable/firecrawl-js
Recursive URL Loader
Recursively load pages from a website. Module:@langchain/community/document_loaders/web/recursive_url
Sitemap Loader
Load pages from a sitemap.xml. Module:@langchain/community/document_loaders/web/sitemap
HTML Loader
Load HTML content with customizable parsing. Module:@langchain/community/document_loaders/web/htmlRequires:
html-to-text
Web PDF Loader
Load PDFs from URLs. Module:@langchain/community/document_loaders/web/pdf
Cloud Storage Loaders
Load documents from cloud storage services.S3 Loader
Load files from Amazon S3.@langchain/community/document_loaders/web/s3Requires:
@aws-sdk/client-s3
Azure Blob Storage Loader
Load files from Azure Blob Storage. Modules:@langchain/community/document_loaders/web/azure_blob_storage_file@langchain/community/document_loaders/web/azure_blob_storage_container
@azure/storage-blob
Google Cloud Storage Loader
Load files from Google Cloud Storage. Module:@langchain/community/document_loaders/web/google_cloud_storageRequires:
@google-cloud/storage
API & Service Loaders
Load documents from various APIs and services.GitHub Loader
Load files from GitHub repositories.@langchain/community/document_loaders/web/github
Notion API Loader
Load pages from Notion using the official API. Module:@langchain/community/document_loaders/web/notionapiRequires:
@notionhq/client, notion-to-md
Confluence Loader
Load pages from Atlassian Confluence. Module:@langchain/community/document_loaders/web/confluence
Jira Loader
Load issues from Atlassian Jira. Module:@langchain/community/document_loaders/web/jira
GitBook Loader
Load documentation from GitBook. Module:@langchain/community/document_loaders/web/gitbook
Figma Loader
Load designs from Figma. Module:@langchain/community/document_loaders/web/figma
Airtable Loader
Load data from Airtable bases. Module:@langchain/community/document_loaders/web/airtable
YouTube Loader
Load transcripts from YouTube videos. Module:@langchain/community/document_loaders/web/youtubeRequires:
youtubei.js
Apify Dataset Loader
Load data from Apify datasets. Module:@langchain/community/document_loaders/web/apify_datasetRequires:
apify-client
AssemblyAI Loader
Transcribe audio and video using AssemblyAI. Module:@langchain/community/document_loaders/web/assemblyaiRequires:
assemblyai
Sonix Audio Loader
Transcribe audio using Sonix. Module:@langchain/community/document_loaders/web/sonix_audioRequires:
sonix-speech-recognition
Browserbase Loader
Load web pages using Browserbase headless browsers. Module:@langchain/community/document_loaders/web/browserbaseRequires:
@browserbasehq/sdk
Spider Loader
Load web content using Spider API. Module:@langchain/community/document_loaders/web/spiderRequires:
@spider-cloud/spider-client
Database Loaders
Couchbase Loader
Load documents from Couchbase. Module:@langchain/community/document_loaders/web/couchbaseRequires:
couchbase
Search API Loaders
SerpAPI Loader
Load search results from SerpAPI. Module:@langchain/community/document_loaders/web/serpapi
SearchAPI Loader
Load search results from SearchAPI. Module:@langchain/community/document_loaders/web/searchapi
Hacker News Loader
Load stories and comments from Hacker News. Module:@langchain/community/document_loaders/web/hn
Specialized Loaders
College Confidential Loader
Load content from College Confidential forums. Module:@langchain/community/document_loaders/web/college_confidential
IMSDb Loader
Load movie scripts from IMSDb. Module:@langchain/community/document_loaders/web/imsdb
Taskade Loader
Load tasks and projects from Taskade. Module:@langchain/community/document_loaders/web/taskade
Sort.xyz Blockchain Loader
Load blockchain data from Sort.xyz. Module:@langchain/community/document_loaders/web/sort_xyz_blockchain
Usage Patterns
Basic Loading
Lazy Loading
Many loaders support lazy loading for large datasets:Custom Processing
Extend loaders to customize document processing:Document Structure
All loaders return documents with this structure:source: File path or URLloc: Location information (page, line, etc.)title: Document titleauthor: Document authorcreatedAt: Creation timestamp
Best Practices
- Install only what you need - Install peer dependencies for the loaders you use
- Handle errors - Wrap loader calls in try-catch blocks
- Use lazy loading - For large datasets, use
loadLazy()to avoid memory issues - Clean metadata - Remove sensitive information from metadata before storing
- Chunk large documents - Use text splitters after loading large documents
Next Steps
- Text Splitters - Split documents into chunks
- Vector Stores - Store loaded documents
- Retrievers - Retrieve relevant documents
