Web Search & Scraping

Overview

Meridian’s web integration tools enable AI agents to access external information through Firecrawl. These tools allow agents to search the web, scrape web pages, and extract structured data—essential for enriching database analysis with real-time external context.

Web tools require a Firecrawl API key configured via the FIRECRAWL_API_KEY environment variable.

Core Functions

firecrawlSearch

Search the web for information using Firecrawl.

query

string

required

The search query to look up on the web.

maxResults

number

default:10

Maximum number of results to return. Default is 10, maximum is 20.

Response

{
  success: boolean
  query: string
  sources: Array<{
    title: string
    url: string
    content: string  // Description/snippet
  }>
  sourceCount: number
  results: any[]  // Full Firecrawl results
  error?: string
}

Example Usage

const searchResults = await firecrawlSearch.handler(ctx, {
  query: 'best practices for database indexing',
  maxResults: 5
})

if (searchResults.success) {
  console.log(`Found ${searchResults.sourceCount} sources:`)
  searchResults.sources.forEach(source => {
    console.log(`${source.title} - ${source.url}`)
    console.log(source.content)
  })
}

scrapeWebPage

Scrape and extract content from a web page using Firecrawl.

url

string

required

The URL of the web page to scrape.

includeMarkdown

boolean

default:true

Whether to include markdown formatted content.

Response

{
  success: boolean
  url: string
  title: string
  content: string  // Markdown or HTML content
  markdown: string
  description: string
  links: string[]
  contentLength: number
  error?: string
}

The agent is instructed to extract insights from the markdown content and provide concise summaries rather than returning raw markdown directly to users.

Example Usage

const pageContent = await scrapeWebPage.handler(ctx, {
  url: 'https://example.com/article',
  includeMarkdown: true
})

if (pageContent.success) {
  console.log(`Title: ${pageContent.title}`)
  console.log(`Description: ${pageContent.description}`)
  console.log(`Content length: ${pageContent.contentLength} characters`)
  console.log(`Found ${pageContent.links.length} links`)
}

extractWebPage

Extract structured data from one or more web pages using Firecrawl.

urls

string[]

required

Array of URLs to extract data from.

prompt

string

required

A prompt describing what data to extract from the web pages.

schema

object

Optional JSON schema defining the structure of the data to extract.

Response

{
  success: boolean
  urls: string[]
  data: any[]  // Extracted structured data
  extractedCount: number
  error?: string
}

Example Usage

const extractedData = await extractWebPage.handler(ctx, {
  urls: [
    'https://store.example.com/product/1',
    'https://store.example.com/product/2'
  ],
  prompt: 'Extract product name, price, and availability',
  schema: {
    type: 'object',
    properties: {
      name: { type: 'string' },
      price: { type: 'number' },
      available: { type: 'boolean' }
    }
  }
})

if (extractedData.success) {
  console.log(`Extracted data from ${extractedData.extractedCount} pages:`)
  console.log(JSON.stringify(extractedData.data, null, 2))
}

Implementation Details

Firecrawl SDK Integration

Web tools use the Firecrawl JavaScript SDK (from table_agent.ts:150-186):

import Firecrawl from '@mendable/firecrawl-js'

export const scrapeWebPageAction = action({
  args: { url: v.string(), includeMarkdown: v.optional(v.boolean()) },
  handler: async (_, { url, includeMarkdown = true }) => {
    const apiKey = process.env.FIRECRAWL_API_KEY
    if (!apiKey) {
      throw new Error('FIRECRAWL_API_KEY not configured')
    }

    try {
      const firecrawl = new Firecrawl({ apiKey })
      const result = await firecrawl.scrape(url, {
        formats: includeMarkdown ? ['markdown', 'html'] : ['html'],
        onlyMainContent: true,
      })

      return {
        success: true,
        url,
        title: result.metadata?.title || '',
        markdown: result.markdown || '',
        html: result.html || '',
        content: result.markdown || result.html || '',
        description: result.metadata?.description || '',
        links: result.links || [],
      }
    } catch (error) {
      return {
        success: false,
        error: error instanceof Error ? error.message : 'Unknown error'
      }
    }
  },
})

Tool Definitions

From agent_tools.ts:463-510:

export const firecrawlSearch = createTool({
  description:
    'Search the web for information using Firecrawl. Use when you need current information, facts, or context not in the database.',
  args: z.object({
    query: z.string().describe('The search query'),
    maxResults: z.number().optional().default(10),
  }),
  handler: async (ctx, args) => {
    try {
      const maxResults = Math.min(Math.max(args.maxResults || 10, 1), 20)
      const result = await ctx.runAction(
        api.table_agent.performFirecrawlSearch,
        { query: args.query, maxResults }
      )
      return truncateToolResponse(result)
    } catch (error) {
      return {
        success: false,
        error: error instanceof Error 
          ? error.message 
          : 'Web search failed. Make sure FIRECRAWL_API_KEY is configured.'
      }
    }
  },
})

Search Implementation

From table_agent.ts:112-148:

export const performFirecrawlSearch = action({
  args: { query: v.string(), maxResults: v.optional(v.number()) },
  handler: async (_, { query, maxResults = 10 }) => {
    const apiKey = process.env.FIRECRAWL_API_KEY
    if (!apiKey) {
      throw new Error('FIRECRAWL_API_KEY not configured')
    }

    try {
      const firecrawl = new Firecrawl({ apiKey })
      const result = await firecrawl.search(query, {
        limit: Math.min(maxResults, 20),
      })

      return {
        success: true,
        query,
        results: result.web || [],
        sources: result.web?.map((r: any) => ({
          title: r.title || '',
          url: r.url || '',
          content: r.description || '',
        })) || [],
      }
    } catch (error) {
      return {
        success: false,
        error: error instanceof Error ? error.message : 'Unknown error',
      }
    }
  },
})

Content Extraction

From table_agent.ts:188-227:

export const extractWebPageAction = action({
  args: {
    urls: v.array(v.string()),
    prompt: v.string(),
    schema: v.optional(v.any()),
  },
  handler: async (_, { urls, prompt, schema }) => {
    const apiKey = process.env.FIRECRAWL_API_KEY
    if (!apiKey) {
      throw new Error('FIRECRAWL_API_KEY not configured')
    }

    try {
      const firecrawl = new Firecrawl({ apiKey })
      const result = await firecrawl.extract({
        urls,
        prompt,
        schema: schema || undefined,
      })

      return {
        success: true,
        urls,
        data: result.data
          ? Array.isArray(result.data) ? result.data : [result.data]
          : [],
      }
    } catch (error) {
      return {
        success: false,
        error: error instanceof Error ? error.message : 'Unknown error',
      }
    }
  },
})

Response Truncation

Web tool responses are truncated to optimize token usage:

Content fields: Maximum 5,000 characters (markdown/HTML)
Description fields: Maximum 2,000 characters
Sources/results arrays: Maximum 10 items
Links array: Maximum 10 items

From agent_tools.ts:26-48:

function truncateToolResponse(response: any): any {
  const truncated = { ...response }
  
  if (typeof truncated.content === 'string') {
    truncated.content = truncateString(truncated.content, MAX_CONTENT_LENGTH)
  }
  if (typeof truncated.markdown === 'string') {
    truncated.markdown = truncateString(truncated.markdown, MAX_CONTENT_LENGTH)
  }
  if (Array.isArray(truncated.sources)) {
    truncated.sources = truncateArray(truncated.sources, MAX_ARRAY_ITEMS)
  }
  // ...
  return truncated
}

Usage in Agent Workflows

Both Query and Analysis agents have access to web tools:

const analysis_agent = new Agent(components.agent, {
  name: 'analysis_agent',
  languageModel: model,
  instructions: `
You are an assistant that explores and analyzes databases and can search the web.

Use the available tools to:
- Query and inspect DuckDB tables
- Search or extract info from the web and URLs
- Visualize or analyze data
`,
  tools: {
    queryDuckDB,
    getTableSchema,
    createChart,
    firecrawlSearch,    // ← Web search
    scrapeWebPage,      // ← Web scraping
    extractWebPage,     // ← Structured extraction
    // ...
  },
})

Example Agent Workflow

User: “Compare our sales data with industry benchmarks”

Agent calls queryDuckDB to get internal sales data
Agent calls firecrawlSearch('retail industry benchmarks 2024')
Agent identifies relevant articles from search results
Agent calls scrapeWebPage(article_url) for each relevant source
Agent extracts benchmark data from scraped content
Agent compares internal data with industry benchmarks
Agent presents insights to user

Use Cases

Data Enrichment

Augment database records with external information from web sources.

Competitive Analysis

Gather competitor data and market information for comparison.

Real-time Context

Access current events, news, and trends to contextualize data analysis.

Validation

Verify database information against authoritative web sources.

Error Handling

API key not configured

{
  "success": false,
  "error": "Firecrawl API key not configured. Set FIRECRAWL_API_KEY environment variable."
}

Solution: Configure FIRECRAWL_API_KEY in environment variables.

Rate limit exceeded

{
  "success": false,
  "error": "Rate limit exceeded. Please try again later."
}

Solution: Reduce maxResults or wait before retrying.

Invalid URL

{
  "success": false,
  "error": "Invalid URL format"
}

Solution: Ensure URL includes protocol (http:// or https://).

Page not accessible

{
  "success": false,
  "error": "Failed to access page: 404 Not Found"
}

Solution: Verify URL is accessible and not behind authentication.

Firecrawl Configuration

Environment Setup

Add to your .env file:

FIRECRAWL_API_KEY=your_api_key_here

Get an API key from Firecrawl.

Scraping Options

From the implementation, scraping uses:

const result = await firecrawl.scrape(url, {
  formats: ['markdown', 'html'],  // Get both formats
  onlyMainContent: true,          // Skip navigation, footers, etc.
})

Benefits:

onlyMainContent: Removes boilerplate, focuses on article/page content
markdown: Clean, structured content easy for LLMs to process
html: Preserves formatting when markdown conversion loses structure

Best Practices

Use specific search queries

More specific queries yield better, more relevant results.

Limit results appropriately

Start with fewer results (5-10) to reduce API usage and latency.

Prefer structured extraction

Use extractWebPage with schemas when you need specific data fields.

Cache scraped content

Store frequently accessed web content in your database to reduce API calls.

Handle errors gracefully

Always check success field and provide fallbacks for failed requests.

Security Considerations

Never expose your Firecrawl API key in client-side code
Validate and sanitize URLs before scraping
Be mindful of website terms of service
Implement rate limiting to avoid excessive API usage

Query Tool

Query database to complement web data

Insights Tool

Generate insights combining database and web data

Using AI Agents

Learn how agents leverage web tools

Backend API

Agent Tools

DuckDB Integration

Web Search & Scraping

Overview

Core Functions

firecrawlSearch

Response

Example Usage

scrapeWebPage

Response

Example Usage

extractWebPage

Response

Example Usage

Implementation Details

Firecrawl SDK Integration

Tool Definitions

Search Implementation

Content Extraction

Response Truncation

Usage in Agent Workflows

Example Agent Workflow

Use Cases

Data Enrichment

Competitive Analysis

Real-time Context

Validation

Error Handling

Firecrawl Configuration

Environment Setup

Scraping Options

Best Practices

Security Considerations

Query Tool

Insights Tool

Using AI Agents

Firecrawl Resources

Build docs developers (and LLMs) love

Backend API

Agent Tools

DuckDB Integration

​Overview

​Core Functions

​firecrawlSearch

​Response

​Example Usage

​scrapeWebPage

​Response

​Example Usage

​extractWebPage

​Response

​Example Usage

​Implementation Details

​Firecrawl SDK Integration

​Tool Definitions

​Search Implementation

​Content Extraction

​Response Truncation

​Usage in Agent Workflows

​Example Agent Workflow

​Use Cases

Data Enrichment

Competitive Analysis

Real-time Context

Validation

​Error Handling

​Firecrawl Configuration

​Environment Setup

​Scraping Options

​Best Practices

​Security Considerations

​Related Tools

Query Tool

Insights Tool

Using AI Agents

​Firecrawl Resources

Build docs developers (and LLMs) love

Overview

Core Functions

firecrawlSearch

Response

Example Usage

scrapeWebPage

Response

Example Usage

extractWebPage

Response

Example Usage

Implementation Details

Firecrawl SDK Integration

Tool Definitions

Search Implementation

Content Extraction

Response Truncation

Usage in Agent Workflows

Example Agent Workflow

Use Cases

Error Handling

Firecrawl Configuration

Environment Setup

Scraping Options

Best Practices

Security Considerations

Related Tools

Firecrawl Resources