webScraper

Overview

The webScraper tool uses SerpApi’s headless browser to fetch the full HTML content of a webpage and extracts its main textual content. It’s particularly effective for modern, JavaScript-heavy sites that require browser rendering.

Function Signature

export const webScraper = ai.defineTool(
  {
    name: 'webScraper',
    description: 'Uses a headless browser via SerpApi to fetch the full HTML content of a given URL, then returns its main textual content. Use this to read the content of an article or webpage provided by the user, especially for modern, JS-heavy sites.',
    inputSchema: WebScraperInputSchema,
    outputSchema: WebScraperOutputSchema,
  },
  async (input) => { ... }
);

Configuration

Environment Variables

SERPAPI_API_KEY

string

required

Your SerpApi API key. Obtain one from SerpApi.Error thrown if missing: SERPAPI_API_KEY is not configured for the web scraper. Please add it to your .env file.

Input Schema

The tool accepts input conforming to WebScraperInputSchema:

const WebScraperInputSchema = z.object({
  url: z.string().url().describe('The URL of the webpage to scrape.'),
});

Parameters

url

string

required

The URL of the webpage to scrape. Must be a valid URL format.Validation: Must pass z.string().url() validation.

Output Schema

Returns a string with the extracted content:

const WebScraperOutputSchema = z.string().describe(
  'The extracted textual content of the webpage.'
);

Response

content

string

The extracted and cleaned textual content from the webpage.Maximum length: 15,000 characters (content is truncated to prevent oversized AI context)Cleaning process:

Removes <script>, <style>, <nav>, <footer>, <header>, <aside>, <form>, <button> elements
Removes elements with role="navigation", role="banner", role="contentinfo"
Prioritizes main content areas: <main>, <article>, #content, #main, .post, .entry-content, .article-body
Collapses multiple whitespace characters into single spaces
Trims leading and trailing whitespace

Implementation Details

Content Extraction Strategy

HTML Fetching: Uses SerpApi’s getHtml() method to render and retrieve full HTML
DOM Parsing: Parses HTML using JSDOM
Element Removal: Strips navigation, scripts, styles, and other non-content elements
Main Content Detection: Attempts to locate main content using common selectors:
- <main>
- <article>
- #content
- #main
- .post
- .entry-content
- .article-body
Fallback: Uses <body> if no main content area is found
Text Extraction: Extracts textContent from the selected element
Cleaning: Normalizes whitespace and trims the result
Truncation: Limits output to 15,000 characters

Error Handling

errors

Error

The tool throws errors in the following cases:

Show Error scenarios

Missing API Key

Thrown when: SERPAPI_API_KEY is not set or equals 'YOUR_API_KEY_HERE'Message: SERPAPI_API_KEY is not configured for the web scraper. Please add it to your .env file.

No HTML Content

Thrown when: SerpApi returns no HTMLMessage: Web scraper failed: SerpApi returned no HTML content.

No Meaningful Content

Thrown when: Cannot extract any text from the pageMessage: Web scraper failed: Could not extract meaningful content from the page.

Scraping Error

Thrown when: Any other error occurs during scrapingMessage: Web scraper failed: {error.message}

Example Usage

import { webScraper } from '@/ai/tools/web-scraper';

// Scrape a webpage
const content = await webScraper({ 
  url: 'https://example.com/article' 
});

console.log(content);
// "This is the main article content extracted from the page..."
// (up to 15,000 characters)

Use Cases

Extracting article content for analysis
Reading blog posts or news articles
Scraping documentation pages
Accessing content from JavaScript-heavy sites that require rendering
Processing user-provided URLs for argument analysis

Source Code Location

src/ai/tools/web-scraper.ts:20-75

AI Flows

AI Tools

Types

Overview

Function Signature

Configuration

Environment Variables

Input Schema

Parameters

Output Schema

Response

Implementation Details

Content Extraction Strategy

Error Handling

Example Usage

Use Cases

Source Code Location

Build docs developers (and LLMs) love

AI Flows

AI Tools

Types

​Overview

​Function Signature

​Configuration

​Environment Variables

​Input Schema

​Parameters

​Output Schema

​Response

​Implementation Details

​Content Extraction Strategy

​Error Handling

​Example Usage

​Use Cases

​Source Code Location

Build docs developers (and LLMs) love

Overview

Function Signature

Configuration

Environment Variables

Input Schema

Parameters

Output Schema

Response

Implementation Details

Content Extraction Strategy

Error Handling

Example Usage

Use Cases

Source Code Location