Web Scraper

Overview

The webScraper tool fetches and parses webpage content using the Firecrawl API. It extracts clean, readable text in Markdown format, optimized for AI analysis. Firecrawl handles JavaScript rendering, anti-bot measures, and content extraction automatically.

Function Signatures

As Genkit Tool

export const webScraper = ai.defineTool(
  {
    name: 'webScraper',
    description: 'Fetches the full content of a given URL using Firecrawl. Use this to read the content of an article or webpage.',
    inputSchema: WebScraperInputSchema,
    outputSchema: WebScraperOutputSchema,
  },
  async (input) => { ... }
)

As Standalone Function

export async function batchScrapeParallel(
  urls: string[]
): Promise<{ url: string; content: string; source: string }[]>

Source: src/ai/tools/web-scraper.ts:89

Input Schema

url

string

required

The URL of the webpage to scrape. Must be a valid URL format.

Input Type

const WebScraperInputSchema = z.object({
  url: z.string().url().describe('The URL of the webpage to scrape.'),
});

Output Schema

Single Scrape (Tool)

content

string

required

The extracted textual content of the webpage in Markdown format (up to 20,000 characters).

const WebScraperOutputSchema = z.string().describe('The extracted textual content of the webpage.');

Batch Scrape (Function)

results

ScrapeResult[]

required

Array of scrape results.Each result contains:

url: The scraped URL
content: Extracted Markdown content (up to 20,000 chars)
source: Always “Firecrawl”

Batch Scraping

The batchScrapeParallel function scrapes multiple URLs simultaneously for maximum performance:

const urls = [
  'https://example.com/article1',
  'https://example.com/article2',
  'https://example.com/article3',
];

const results = await batchScrapeParallel(urls);

console.log(`Scraped ${results.length}/${urls.length} URLs successfully`);

results.forEach(result => {
  console.log(`${result.url}: ${result.content.length} chars`);
});

Parallel Processing

Scrapes all URLs simultaneously using Promise.all
Filters out failed scrapes automatically
Returns only successful results
Logs progress and success rate

How It Works

API Request: Sends scrape request to Firecrawl with Markdown format
Content Extraction: Firecrawl renders JavaScript and extracts clean text
Validation: Checks content length (must be > 100 characters)
Truncation: Limits content to 20,000 characters for context window
Return: Returns Markdown-formatted text

Example Usage

Single Scrape (Tool)

import { webScraper } from '@/ai/tools/web-scraper';

const content = await webScraper({ 
  url: 'https://example.com/article' 
});

console.log(content); // Markdown text

Single Scrape (Direct)

import { scrapeWithFirecrawl } from '@/ai/tools/web-scraper';

const content = await scrapeWithFirecrawl('https://example.com/article');

if (content) {
  console.log(`Scraped ${content.length} characters`);
} else {
  console.log('Scrape failed');
}

Batch Scrape

import { batchScrapeParallel } from '@/ai/tools/web-scraper';

const urls = [
  'https://bbc.com/news/article1',
  'https://reuters.com/article2',
  'https://nytimes.com/article3',
];

const results = await batchScrapeParallel(urls);

With AI Flow

import { ai } from '@/ai/genkit';
import { webScraper } from '@/ai/tools/web-scraper';

const response = await ai.generate({
  prompt: 'Summarize the article at https://example.com/article',
  tools: [webScraper],
});

Configuration

Firecrawl API Settings:

Formats: Markdown
Timeout: 30 seconds
Max Content: 20,000 characters
Min Content: 100 characters (validation)
Endpoint: https://api.firecrawl.dev/v1/scrape

Environment Variables

FIRECRAWL_API_KEY

string

required

Your Firecrawl API key. Get one at firecrawl.dev.

FIRECRAWL_API_KEY=your_api_key_here

Error Handling

Single Scrape

Throws error if scrape fails
Returns null if content too short
Logs warnings for debugging

Batch Scrape

Silently skips failed URLs
Returns only successful scrapes
Logs success/failure counts
Never throws (returns empty array if all fail)

Performance

Single Scrape: 1-5 seconds depending on page
Batch Scrape: Parallel processing, same as slowest URL
Success Rate: ~90-95% on standard news sites
Content Quality: High - JavaScript rendered, clean extraction

Use Cases

Argument Analysis: Extract article content for blueprint generation
Research: Gather information from multiple sources
Fact Checking: Retrieve full context of claims
Content Summarization: Get clean text for AI summarization
Source Verification: Read original sources cited in arguments

AI Flows

Tools & Integrations

Data Schemas

Overview

Function Signatures

As Genkit Tool

As Standalone Function

Input Schema

Input Type

Output Schema

Single Scrape (Tool)

Batch Scrape (Function)

Batch Scraping

Parallel Processing

How It Works

Example Usage

Single Scrape (Tool)

Single Scrape (Direct)

Batch Scrape

With AI Flow

Configuration

Environment Variables

Error Handling

Single Scrape

Batch Scrape

Performance

Use Cases

Build docs developers (and LLMs) love

AI Flows

Tools & Integrations

Data Schemas

​Overview

​Function Signatures

​As Genkit Tool

​As Standalone Function

​Input Schema

​Input Type

​Output Schema

​Single Scrape (Tool)

​Batch Scrape (Function)

​Batch Scraping

​Parallel Processing

​How It Works

​Example Usage

​Single Scrape (Tool)

​Single Scrape (Direct)

​Batch Scrape

​With AI Flow

​Configuration

​Environment Variables

​Error Handling

​Single Scrape

​Batch Scrape

​Performance

​Use Cases

​Related

Build docs developers (and LLMs) love

Overview

Function Signatures

As Genkit Tool

As Standalone Function

Input Schema

Input Type

Output Schema

Single Scrape (Tool)

Batch Scrape (Function)

Batch Scraping

Parallel Processing

How It Works

Example Usage

Single Scrape (Tool)

Single Scrape (Direct)

Batch Scrape

With AI Flow

Configuration

Environment Variables

Error Handling

Single Scrape

Batch Scrape

Performance

Use Cases

Related