Skip to main content
The Cheerio Web Scraper allows you to extract content from web pages using CSS selectors. It’s a fast and efficient way to load data from websites into your Flowise workflows.

Overview

This loader uses the LangChain CheerioWebBaseLoader to fetch and parse HTML content from web pages. It supports single page scraping, web crawling, and sitemap-based extraction.
What is Cheerio?Cheerio is a fast, flexible HTML parsing library for Node.js. It implements a subset of jQuery’s API, making it easy to traverse and manipulate HTML documents with familiar CSS selectors.

Configuration

Basic Parameters

url
string
required
The URL of the webpage to scrape. Must be a valid HTTP or HTTPS URL.Example: https://docs.example.com/getting-started
textSplitter
TextSplitter
Optional text splitter to chunk the extracted content into smaller pieces.

Advanced Parameters

Method to retrieve and process multiple related pages:
Crawls relative links found in the HTML content of the specified URL.
  • Follows <a href> tags
  • Stays within the same domain
  • Respects the specified limit
limit
number
default:"10"
Maximum number of pages to scrape when using relative links method.
  • Set to 0 to scrape all discovered links (use with caution)
  • Default is 10 pages
Retrieving all links might take a long time, and all links will be re-scraped if the flow’s state changes (e.g., different URL, chunk size, etc.).
selector
string
CSS selector to extract specific content from the page.Examples:
  • article - Extract content within <article> tags
  • .content - Extract elements with class “content”
  • #main-content - Extract element with ID “main-content”
  • div.post-body p - Extract paragraphs within div.post-body
If not specified, the entire page body will be extracted.
metadata
json
Additional metadata to attach to all extracted documents.
{
  "source_type": "documentation",
  "scrape_date": "2024-01-15",
  "category": "tutorials"
}
omitMetadataKeys
string
Comma-separated list of metadata keys to exclude from the output.Use * to omit all default metadata keys except those in Additional Metadata.

Output

The web scraper provides two output formats:
Returns an array of document objects with metadata and page content.
[
  {
    "pageContent": "Getting Started\n\nWelcome to our documentation...",
    "metadata": {
      "source": "https://docs.example.com/getting-started",
      "loc": {
        "lines": {
          "from": 1,
          "to": 1
        }
      }
    }
  }
]

Usage Examples

Single Page Scraping

1

Add Web Scraper Node

Drag the Cheerio Web Scraper node onto your canvas.
2

Enter URL

Provide the URL of the page you want to scrape.
https://flowise.ai/docs/getting-started
3

Optional: Add CSS Selector

If you only need specific content, add a CSS selector:
article.documentation
4

Connect to Processing Nodes

Connect the output to vector stores, LLMs, or other nodes.

Crawling Multiple Pages

{
  "url": "https://docs.example.com",
  "relativeLinksMethod": "webCrawl",
  "limit": 25,
  "selector": ".markdown-body"
}
This configuration will:
  1. Start at the base URL
  2. Discover up to 25 relative links
  3. Extract only content within .markdown-body class from each page

Using XML Sitemap

{
  "url": "https://docs.example.com/sitemap.xml",
  "relativeLinksMethod": "scrapeXMLSitemap",
  "limit": 50
}
Using an XML sitemap is typically faster and more reliable than web crawling, especially for large documentation sites.

Extracting Specific Content

article.blog-post
Extracts content from blog post articles only.

Common Use Cases

Documentation Indexing

Scrape and index entire documentation sites for AI-powered search

Content Monitoring

Periodically scrape pages to monitor content changes

Knowledge Base Creation

Build knowledge bases from web content for RAG applications

Competitive Analysis

Extract competitor information for analysis (respect robots.txt)

Limitations

Important Limitations
  • No JavaScript Execution: Cheerio only parses static HTML. For JavaScript-heavy sites, use Puppeteer or Playwright loaders instead.
  • PDF Files: The loader skips PDF URLs automatically. Use the PDF loader for PDF content.
  • Authentication: Does not support pages requiring login. Use authenticated API loaders for protected content.

Troubleshooting

Possible causes:
  1. The page uses JavaScript to render content (use Puppeteer loader instead)
  2. CSS selector is incorrect or too specific
  3. Website blocks scrapers (check user agent requirements)
Solutions:
  • Inspect the page HTML to verify selector
  • Remove the selector to get all content first
  • Try the Puppeteer Web Scraper for JS-rendered pages
Solutions:
  • Reduce the limit parameter to fewer pages
  • Use XML sitemap method instead of web crawl
  • Add more specific CSS selectors to reduce content size
  • Implement caching for frequently accessed pages
Ensure your URL:
  • Starts with http:// or https://
  • Is properly formatted with no spaces
  • Points to an accessible webpage

Best Practices

Web Scraping Ethics
  1. Respect robots.txt: Check website’s robots.txt file before scraping
  2. Rate Limiting: Use reasonable limits to avoid overwhelming servers
  3. Terms of Service: Ensure scraping complies with website ToS
  4. Attribution: Keep source metadata when using scraped content
Optimization Tips
  • Use specific CSS selectors to reduce noise and improve quality
  • Combine with text splitters for better chunking
  • Add descriptive metadata to improve retrieval accuracy
  • Cache frequently accessed pages to reduce API calls

Comparison with Other Web Scrapers

FeatureCheerioPuppeteerPlaywright
Speed⚡ Fastest🐢 Slower🐢 Slower
JavaScript Support❌ No✅ Yes✅ Yes
Resource Usage💚 Low🔴 High🔴 High
Best ForStatic HTMLDynamic sitesCross-browser testing

Puppeteer Scraper

For JavaScript-rendered pages

Vector Stores

Store scraped content for retrieval

Document Loaders

Explore other loader types

Build docs developers (and LLMs) love