Skip to main content

Overview

The webScraper tool fetches and parses webpage content using the Firecrawl API. It extracts clean, readable text in Markdown format, optimized for AI analysis. Firecrawl handles JavaScript rendering, anti-bot measures, and content extraction automatically.

Function Signatures

As Genkit Tool

export const webScraper = ai.defineTool(
  {
    name: 'webScraper',
    description: 'Fetches the full content of a given URL using Firecrawl. Use this to read the content of an article or webpage.',
    inputSchema: WebScraperInputSchema,
    outputSchema: WebScraperOutputSchema,
  },
  async (input) => { ... }
)

As Standalone Function

export async function batchScrapeParallel(
  urls: string[]
): Promise<{ url: string; content: string; source: string }[]>
Source: src/ai/tools/web-scraper.ts:89

Input Schema

url
string
required
The URL of the webpage to scrape. Must be a valid URL format.

Input Type

const WebScraperInputSchema = z.object({
  url: z.string().url().describe('The URL of the webpage to scrape.'),
});

Output Schema

Single Scrape (Tool)

content
string
required
The extracted textual content of the webpage in Markdown format (up to 20,000 characters).
const WebScraperOutputSchema = z.string().describe('The extracted textual content of the webpage.');

Batch Scrape (Function)

results
ScrapeResult[]
required
Array of scrape results.Each result contains:
  • url: The scraped URL
  • content: Extracted Markdown content (up to 20,000 chars)
  • source: Always “Firecrawl”

Batch Scraping

The batchScrapeParallel function scrapes multiple URLs simultaneously for maximum performance:
const urls = [
  'https://example.com/article1',
  'https://example.com/article2',
  'https://example.com/article3',
];

const results = await batchScrapeParallel(urls);

console.log(`Scraped ${results.length}/${urls.length} URLs successfully`);

results.forEach(result => {
  console.log(`${result.url}: ${result.content.length} chars`);
});

Parallel Processing

  • Scrapes all URLs simultaneously using Promise.all
  • Filters out failed scrapes automatically
  • Returns only successful results
  • Logs progress and success rate

How It Works

  1. API Request: Sends scrape request to Firecrawl with Markdown format
  2. Content Extraction: Firecrawl renders JavaScript and extracts clean text
  3. Validation: Checks content length (must be > 100 characters)
  4. Truncation: Limits content to 20,000 characters for context window
  5. Return: Returns Markdown-formatted text

Example Usage

Single Scrape (Tool)

import { webScraper } from '@/ai/tools/web-scraper';

const content = await webScraper({ 
  url: 'https://example.com/article' 
});

console.log(content); // Markdown text

Single Scrape (Direct)

import { scrapeWithFirecrawl } from '@/ai/tools/web-scraper';

const content = await scrapeWithFirecrawl('https://example.com/article');

if (content) {
  console.log(`Scraped ${content.length} characters`);
} else {
  console.log('Scrape failed');
}

Batch Scrape

import { batchScrapeParallel } from '@/ai/tools/web-scraper';

const urls = [
  'https://bbc.com/news/article1',
  'https://reuters.com/article2',
  'https://nytimes.com/article3',
];

const results = await batchScrapeParallel(urls);

With AI Flow

import { ai } from '@/ai/genkit';
import { webScraper } from '@/ai/tools/web-scraper';

const response = await ai.generate({
  prompt: 'Summarize the article at https://example.com/article',
  tools: [webScraper],
});

Configuration

Firecrawl API Settings:
  • Formats: Markdown
  • Timeout: 30 seconds
  • Max Content: 20,000 characters
  • Min Content: 100 characters (validation)
  • Endpoint: https://api.firecrawl.dev/v1/scrape

Environment Variables

FIRECRAWL_API_KEY
string
required
Your Firecrawl API key. Get one at firecrawl.dev.
FIRECRAWL_API_KEY=your_api_key_here

Error Handling

Single Scrape

  • Throws error if scrape fails
  • Returns null if content too short
  • Logs warnings for debugging

Batch Scrape

  • Silently skips failed URLs
  • Returns only successful scrapes
  • Logs success/failure counts
  • Never throws (returns empty array if all fail)

Performance

  • Single Scrape: 1-5 seconds depending on page
  • Batch Scrape: Parallel processing, same as slowest URL
  • Success Rate: ~90-95% on standard news sites
  • Content Quality: High - JavaScript rendered, clean extraction

Use Cases

  • Argument Analysis: Extract article content for blueprint generation
  • Research: Gather information from multiple sources
  • Fact Checking: Retrieve full context of claims
  • Content Summarization: Get clean text for AI summarization
  • Source Verification: Read original sources cited in arguments

Build docs developers (and LLMs) love