Overview
This loader uses the LangChainCheerioWebBaseLoader to fetch and parse HTML content from web pages. It supports single page scraping, web crawling, and sitemap-based extraction.
What is Cheerio?Cheerio is a fast, flexible HTML parsing library for Node.js. It implements a subset of jQuery’s API, making it easy to traverse and manipulate HTML documents with familiar CSS selectors.
Configuration
Basic Parameters
The URL of the webpage to scrape. Must be a valid HTTP or HTTPS URL.Example:
https://docs.example.com/getting-startedOptional text splitter to chunk the extracted content into smaller pieces.
Advanced Parameters
Method to retrieve and process multiple related pages:
- Web Crawl
- Scrape XML Sitemap
Crawls relative links found in the HTML content of the specified URL.
- Follows
<a href>tags - Stays within the same domain
- Respects the specified limit
Maximum number of pages to scrape when using relative links method.
- Set to
0to scrape all discovered links (use with caution) - Default is
10pages
CSS selector to extract specific content from the page.Examples:
article- Extract content within<article>tags.content- Extract elements with class “content”#main-content- Extract element with ID “main-content”div.post-body p- Extract paragraphs within div.post-body
If not specified, the entire page body will be extracted.
Additional metadata to attach to all extracted documents.
Comma-separated list of metadata keys to exclude from the output.Use
* to omit all default metadata keys except those in Additional Metadata.Output
The web scraper provides two output formats:- Document
- Text
Returns an array of document objects with metadata and page content.
Usage Examples
Single Page Scraping
Crawling Multiple Pages
- Start at the base URL
- Discover up to 25 relative links
- Extract only content within
.markdown-bodyclass from each page
Using XML Sitemap
Extracting Specific Content
- Blog Posts
- Documentation
- Product Descriptions
- News Articles
Common Use Cases
Documentation Indexing
Scrape and index entire documentation sites for AI-powered search
Content Monitoring
Periodically scrape pages to monitor content changes
Knowledge Base Creation
Build knowledge bases from web content for RAG applications
Competitive Analysis
Extract competitor information for analysis (respect robots.txt)
Limitations
Troubleshooting
Empty or missing content
Empty or missing content
Possible causes:
- The page uses JavaScript to render content (use Puppeteer loader instead)
- CSS selector is incorrect or too specific
- Website blocks scrapers (check user agent requirements)
- Inspect the page HTML to verify selector
- Remove the selector to get all content first
- Try the Puppeteer Web Scraper for JS-rendered pages
No relative links found
No relative links found
Possible causes:
- URL doesn’t contain links to other pages
- Links use absolute URLs to different domains
- Sitemap URL is incorrect
- Verify the page contains
<a>tags with relative hrefs - If using sitemap method, ensure the URL points to a valid XML sitemap
- Check browser DevTools to see the page structure
Scraping takes too long
Scraping takes too long
Solutions:
- Reduce the limit parameter to fewer pages
- Use XML sitemap method instead of web crawl
- Add more specific CSS selectors to reduce content size
- Implement caching for frequently accessed pages
Invalid URL error
Invalid URL error
Ensure your URL:
- Starts with
http://orhttps:// - Is properly formatted with no spaces
- Points to an accessible webpage
Best Practices
Web Scraping Ethics
- Respect robots.txt: Check website’s robots.txt file before scraping
- Rate Limiting: Use reasonable limits to avoid overwhelming servers
- Terms of Service: Ensure scraping complies with website ToS
- Attribution: Keep source metadata when using scraped content
Comparison with Other Web Scrapers
| Feature | Cheerio | Puppeteer | Playwright |
|---|---|---|---|
| Speed | ⚡ Fastest | 🐢 Slower | 🐢 Slower |
| JavaScript Support | ❌ No | ✅ Yes | ✅ Yes |
| Resource Usage | 💚 Low | 🔴 High | 🔴 High |
| Best For | Static HTML | Dynamic sites | Cross-browser testing |
Related Resources
Puppeteer Scraper
For JavaScript-rendered pages
Vector Stores
Store scraped content for retrieval
Document Loaders
Explore other loader types