Overview
ThewebScraper tool uses SerpApi’s headless browser to fetch the full HTML content of a webpage and extracts its main textual content. It’s particularly effective for modern, JavaScript-heavy sites that require browser rendering.
Function Signature
Configuration
Environment Variables
Your SerpApi API key. Obtain one from SerpApi.Error thrown if missing:
SERPAPI_API_KEY is not configured for the web scraper. Please add it to your .env file.Input Schema
The tool accepts input conforming toWebScraperInputSchema:
Parameters
The URL of the webpage to scrape. Must be a valid URL format.Validation: Must pass
z.string().url() validation.Output Schema
Returns a string with the extracted content:Response
The extracted and cleaned textual content from the webpage.Maximum length: 15,000 characters (content is truncated to prevent oversized AI context)Cleaning process:
- Removes
<script>,<style>,<nav>,<footer>,<header>,<aside>,<form>,<button>elements - Removes elements with
role="navigation",role="banner",role="contentinfo" - Prioritizes main content areas:
<main>,<article>,#content,#main,.post,.entry-content,.article-body - Collapses multiple whitespace characters into single spaces
- Trims leading and trailing whitespace
Implementation Details
Content Extraction Strategy
- HTML Fetching: Uses SerpApi’s
getHtml()method to render and retrieve full HTML - DOM Parsing: Parses HTML using JSDOM
- Element Removal: Strips navigation, scripts, styles, and other non-content elements
- Main Content Detection: Attempts to locate main content using common selectors:
<main><article>#content#main.post.entry-content.article-body
- Fallback: Uses
<body>if no main content area is found - Text Extraction: Extracts
textContentfrom the selected element - Cleaning: Normalizes whitespace and trims the result
- Truncation: Limits output to 15,000 characters
Error Handling
The tool throws errors in the following cases:
Example Usage
Use Cases
- Extracting article content for analysis
- Reading blog posts or news articles
- Scraping documentation pages
- Accessing content from JavaScript-heavy sites that require rendering
- Processing user-provided URLs for argument analysis
Source Code Location
src/ai/tools/web-scraper.ts:20-75