Web Scraper (Cheerio)

The Cheerio Web Scraper allows you to extract content from web pages using CSS selectors. It’s a fast and efficient way to load data from websites into your Flowise workflows.

Overview

This loader uses the LangChain CheerioWebBaseLoader to fetch and parse HTML content from web pages. It supports single page scraping, web crawling, and sitemap-based extraction.

What is Cheerio?Cheerio is a fast, flexible HTML parsing library for Node.js. It implements a subset of jQuery’s API, making it easy to traverse and manipulate HTML documents with familiar CSS selectors.

Configuration

Basic Parameters

url

string

required

The URL of the webpage to scrape. Must be a valid HTTP or HTTPS URL.Example: https://docs.example.com/getting-started

textSplitter

TextSplitter

Optional text splitter to chunk the extracted content into smaller pieces.

Advanced Parameters

relativeLinksMethod

options

Method to retrieve and process multiple related pages:

Web Crawl
Scrape XML Sitemap

Crawls relative links found in the HTML content of the specified URL.

Follows <a href> tags
Stays within the same domain
Respects the specified limit

Extracts URLs from an XML sitemap.

Faster than web crawling
Requires a valid sitemap URL
Ideal for structured documentation sites

Example: https://docs.example.com/sitemap.xml

limit

number

default:"10"

Maximum number of pages to scrape when using relative links method.

Set to 0 to scrape all discovered links (use with caution)
Default is 10 pages

Retrieving all links might take a long time, and all links will be re-scraped if the flow’s state changes (e.g., different URL, chunk size, etc.).

selector

string

CSS selector to extract specific content from the page.Examples:

article - Extract content within <article> tags
.content - Extract elements with class “content”
#main-content - Extract element with ID “main-content”
div.post-body p - Extract paragraphs within div.post-body

If not specified, the entire page body will be extracted.

metadata

json

Additional metadata to attach to all extracted documents.

{
  "source_type": "documentation",
  "scrape_date": "2024-01-15",
  "category": "tutorials"
}

omitMetadataKeys

string

Comma-separated list of metadata keys to exclude from the output.Use * to omit all default metadata keys except those in Additional Metadata.

Output

The web scraper provides two output formats:

Document
Text

Returns an array of document objects with metadata and page content.

[
  {
    "pageContent": "Getting Started\n\nWelcome to our documentation...",
    "metadata": {
      "source": "https://docs.example.com/getting-started",
      "loc": {
        "lines": {
          "from": 1,
          "to": 1
        }
      }
    }
  }
]

Returns concatenated text from all scraped pages.

Getting Started

Welcome to our documentation...

Installation Guide

Follow these steps to install...

Usage Examples

Single Page Scraping

Add Web Scraper Node

Drag the Cheerio Web Scraper node onto your canvas.

Enter URL

Provide the URL of the page you want to scrape.

https://flowise.ai/docs/getting-started

Optional: Add CSS Selector

If you only need specific content, add a CSS selector:

article.documentation

Connect to Processing Nodes

Connect the output to vector stores, LLMs, or other nodes.

Crawling Multiple Pages

{
  "url": "https://docs.example.com",
  "relativeLinksMethod": "webCrawl",
  "limit": 25,
  "selector": ".markdown-body"
}

This configuration will:

Start at the base URL
Discover up to 25 relative links
Extract only content within .markdown-body class from each page

Using XML Sitemap

{
  "url": "https://docs.example.com/sitemap.xml",
  "relativeLinksMethod": "scrapeXMLSitemap",
  "limit": 50
}

Using an XML sitemap is typically faster and more reliable than web crawling, especially for large documentation sites.

Extracting Specific Content

Blog Posts
Documentation
Product Descriptions
News Articles

article.blog-post

Extracts content from blog post articles only.

.docs-content

Extracts documentation content sections.

div.product-description

Extracts product information from e-commerce pages.

.article-body p

Extracts paragraph content from news articles.

Common Use Cases

Documentation Indexing

Scrape and index entire documentation sites for AI-powered search

Content Monitoring

Periodically scrape pages to monitor content changes

Knowledge Base Creation

Build knowledge bases from web content for RAG applications

Competitive Analysis

Extract competitor information for analysis (respect robots.txt)

Limitations

Important Limitations

No JavaScript Execution: Cheerio only parses static HTML. For JavaScript-heavy sites, use Puppeteer or Playwright loaders instead.
PDF Files: The loader skips PDF URLs automatically. Use the PDF loader for PDF content.
Authentication: Does not support pages requiring login. Use authenticated API loaders for protected content.

Troubleshooting

Empty or missing content

Possible causes:

The page uses JavaScript to render content (use Puppeteer loader instead)
CSS selector is incorrect or too specific
Website blocks scrapers (check user agent requirements)

Solutions:

Inspect the page HTML to verify selector
Remove the selector to get all content first
Try the Puppeteer Web Scraper for JS-rendered pages

No relative links found

Possible causes:

URL doesn’t contain links to other pages
Links use absolute URLs to different domains
Sitemap URL is incorrect

Solutions:

Verify the page contains <a> tags with relative hrefs
If using sitemap method, ensure the URL points to a valid XML sitemap
Check browser DevTools to see the page structure

Scraping takes too long

Solutions:

Reduce the limit parameter to fewer pages
Use XML sitemap method instead of web crawl
Add more specific CSS selectors to reduce content size
Implement caching for frequently accessed pages

Invalid URL error

Ensure your URL:

Starts with http:// or https://
Is properly formatted with no spaces
Points to an accessible webpage

Best Practices

Web Scraping Ethics

Respect robots.txt: Check website’s robots.txt file before scraping
Rate Limiting: Use reasonable limits to avoid overwhelming servers
Terms of Service: Ensure scraping complies with website ToS
Attribution: Keep source metadata when using scraped content

Optimization Tips

Use specific CSS selectors to reduce noise and improve quality
Combine with text splitters for better chunking
Add descriptive metadata to improve retrieval accuracy
Cache frequently accessed pages to reduce API calls

Comparison with Other Web Scrapers

Feature	Cheerio	Puppeteer	Playwright
Speed	⚡ Fastest	🐢 Slower	🐢 Slower
JavaScript Support	❌ No	✅ Yes	✅ Yes
Resource Usage	💚 Low	🔴 High	🔴 High
Best For	Static HTML	Dynamic sites	Cross-browser testing

Puppeteer Scraper

For JavaScript-rendered pages

Vector Stores

Store scraped content for retrieval

Document Loaders

Explore other loader types

Overview

Language Models

Vector Stores

Document Loaders

Agents & Tools

Web Scraper (Cheerio)

Overview

Configuration

Basic Parameters

Advanced Parameters

Output

Usage Examples

Single Page Scraping

Crawling Multiple Pages

Using XML Sitemap

Extracting Specific Content

Common Use Cases

Documentation Indexing

Content Monitoring

Knowledge Base Creation

Competitive Analysis

Limitations

Troubleshooting

Best Practices

Comparison with Other Web Scrapers

Puppeteer Scraper

Vector Stores

Document Loaders

Build docs developers (and LLMs) love

Overview

Language Models

Vector Stores

Document Loaders

Agents & Tools

​Overview

​Configuration

​Basic Parameters

​Advanced Parameters

​Output

​Usage Examples

​Single Page Scraping

​Crawling Multiple Pages

​Using XML Sitemap

​Extracting Specific Content

​Common Use Cases

Documentation Indexing

Content Monitoring

Knowledge Base Creation

Competitive Analysis

​Limitations

​Troubleshooting

​Best Practices

​Comparison with Other Web Scrapers

​Related Resources

Puppeteer Scraper

Vector Stores

Document Loaders

Build docs developers (and LLMs) love

Overview

Configuration

Basic Parameters

Advanced Parameters

Output

Usage Examples

Single Page Scraping

Crawling Multiple Pages

Using XML Sitemap

Extracting Specific Content

Common Use Cases

Limitations

Troubleshooting

Best Practices

Comparison with Other Web Scrapers

Related Resources