Web Scraper

Overview

The Web Scraper is the second step in the email generation pipeline. It uses Exa Search API to find information about the recipient through a dual-query strategy: one for background/biography, one for publications/achievements. Purpose:

Fetch comprehensive information about the recipient
Summarize background and professional details
Gather publication and achievement data
Provide cited sources for the Email Composer

Timing: ~5.3 seconds (varies by content length)
Search API: Exa Search (AI-native search with citations)
Summarization: Claude Haiku 4.5 (via Exa’s built-in AI features)

Input Schema

The Web Scraper requires these fields from Step 1 (Template Parser):

search_terms

string[]

required

Search terms extracted by Template ParserConstraints:

Must not be empty (Step 1 must complete first)
Used to build targeted queries

template_type

enum

required

Template classification from Step 1Values:

RESEARCH - Focus on academic publications
BOOK - Focus on authored books
GENERAL - Broad professional information

recipient_name

string

required

Name of the email recipientUsage: Included in both background and publications queries

recipient_interest

string

required

Recipient’s research area or topicUsage: Helps narrow search results to relevant content

Output Schema

The Web Scraper updates PipelineData with:

scraped_content

string

Combined summary of background and publications informationFormat:

[Background summary]

[Publications summary]

**SOURCES:**
- Source Title 1: https://url1.com
- Source Title 2: https://url2.com

Max Length: 3,000 characters (enforced for Email Composer context window)

scraped_urls

string[]

URLs of all sources cited in the summaryExample:

[
  "https://university.edu/faculty/jane-smith",
  "https://scholar.google.com/citations?user=abc123",
  "https://research-lab.org/team/smith"
]

scraped_page_contents

object

Mapping of URLs to their full text content (for debugging)Structure:

{
  "https://url1.com": "Full text content...",
  "https://url2.com": "Full text content..."
}

scraping_metadata

object

Metadata about the scraping operation:

source - Always "exa_dual" (dual-query strategy)
success - Boolean indicating if information was found
citation_count - Number of sources cited
background_length - Character count of background answer
publications_length - Character count of publications answer
combined_length - Character count of combined summary

Implementation Details

Dual-Query Strategy

The scraper executes two parallel Exa queries for comprehensive coverage:

def _build_queries(self, pipeline_data: PipelineData) -> tuple[str, str]:
    """Build background and publications queries."""
    background = build_background_query(
        recipient_name=pipeline_data.recipient_name,
        recipient_interest=pipeline_data.recipient_interest
    )
    publications = build_publications_query(
        recipient_name=pipeline_data.recipient_name,
        recipient_interest=pipeline_data.recipient_interest,
        template_type=pipeline_data.template_type
    )
    return background, publications

Source: pipeline/steps/web_scraper/main.py:30-41

Query 1: Background

Focuses on:

Current position and affiliation
Educational background
Research interests and expertise
Lab or group affiliations

Example Query:

Dr. Jane Smith machine learning healthcare - current position, 
research interests, university affiliation

Query 2: Publications

Focuses on (varies by template_type):

RESEARCH: Recent papers, top publications, Google Scholar profile
BOOK: Authored books, published works, writing credits
GENERAL: Professional achievements, notable projects

Example Query (RESEARCH):

Dr. Jane Smith machine learning healthcare - recent publications, 
research papers, Google Scholar

Source: pipeline/steps/web_scraper/prompts.py

Exa Search Integration

The step uses Exa’s native AI-powered search and summarization:

result = await self.exa_client.dual_answer(
    background_query=background_query,
    publications_query=publications_query,
    timeout=45.0
)

Exa Features Used:

AI Search - Semantic search beyond keyword matching
Auto-summarization - Built-in AI answers from search results
Citations - Automatic source attribution
Content Extraction - Full page text for each result

Source: pipeline/steps/web_scraper/main.py:64-68

Result Formatting

The dual query results are combined with source attribution:

def _format_result(self, result: DualQueryResult) -> str:
    """Format combined answer with sources."""
    formatted = result.combined_answer
    
    if result.all_citations:
        sources = [f"- {c.title or 'Source'}: {c.url}" for c in result.all_citations]
        formatted += "\n\n**SOURCES:**\n" + "\n".join(sources)
    
    return formatted

Source: pipeline/steps/web_scraper/main.py:43-51

Execution Flow

Validate Input - Check that Step 1 completed successfully
Build Queries - Create background and publications queries
Execute Dual Search - Send both queries to Exa API (parallel)
Check Results - Handle empty results gracefully
Format Output - Combine summaries with source citations
Update Pipeline Data - Store content, URLs, and metadata
Return Success - With citation count and content length

async def _execute_step(self, pipeline_data: PipelineData) -> StepResult:
    background_query, publications_query = self._build_queries(pipeline_data)
    
    logfire.info(
        "Executing dual queries",
        recipient=pipeline_data.recipient_name,
        template_type=pipeline_data.template_type.value
    )
    
    result = await self.exa_client.dual_answer(
        background_query=background_query,
        publications_query=publications_query,
        timeout=45.0
    )
    
    # Handle empty results
    if not result.combined_answer or result.combined_answer == "No information found.":
        pipeline_data.scraped_content = "No information found for this professor."
        return StepResult(success=True, warnings=["No information found in Exa search"])
    
    # Format and store results
    formatted = self._format_result(result)
    pipeline_data.scraped_content = formatted[:3000]  # Enforce max length
    pipeline_data.scraped_urls = [c.url for c in result.all_citations]
    
    return StepResult(success=True, step_name=self.step_name)

Source: pipeline/steps/web_scraper/main.py:53-122

Error Handling

Non-Fatal Errors (Pipeline Continues)

These errors are logged but don’t stop the pipeline:

No Results Found - Sets scraped_content to fallback message
Empty Answer - Pipeline continues with minimal information

When no information is found:

pipeline_data.scraped_content = "No information found for this professor."
pipeline_data.scraped_urls = []
pipeline_data.scraping_metadata = {
    "source": "exa_dual",
    "success": False,
    "citation_count": 0
}

Source: pipeline/steps/web_scraper/main.py:72-80

Fatal Errors (Pipeline Stops)

These errors will halt the pipeline:

Timeout Error - Exa API doesn’t respond within 45 seconds
Connection Error - Unable to connect to Exa API
API Error - Exa returns error response

All errors are wrapped in ExternalAPIError with detailed messages.

except TimeoutError as e:
    logfire.error("Exa timeout", error=str(e))
    raise ExternalAPIError(f"Exa search timed out: {e}")
except ConnectionError as e:
    logfire.error("Exa connection error", error=str(e))
    raise ExternalAPIError(f"Failed to connect to Exa API: {e}")
except Exception as e:
    logfire.error("Exa failed", error_type=type(e).__name__)
    raise ExternalAPIError(f"Exa search failed: {e}")

Source: pipeline/steps/web_scraper/main.py:124-132

Logging & Observability

The step emits detailed logs to Logfire:

logfire.info(
    "Dual search complete",
    bg_len=len(result.background.answer),
    pub_len=len(result.publications.answer),
    citations=len(result.all_citations),
    background_summary=result.background.answer,
    publications_summary=result.publications.answer,
    combined_summary=result.combined_answer
)

Source: pipeline/steps/web_scraper/main.py:104-112

Tracked Metrics

Background answer length
Publications answer length
Combined answer length
Citation count
URLs found
Execution duration
Success/failure status

Configuration

The step is configured via environment variables:

# Exa API credentials
EXA_API_KEY=your_api_key_here

# Timeout for Exa API calls (seconds)
EXA_TIMEOUT=45.0

Source: config/settings.py

Performance Characteristics

Average Execution Time: 5.3 seconds Breakdown:

Exa API dual query: ~3.5s
Result processing: ~0.3s
Network latency: ~1.5s

Variance Factors:

Number of sources found (more citations = longer processing)
Content length (longer pages = more extraction time)
Network conditions

Next Steps

After the Web Scraper completes:

ArXiv Helper uses template_type to decide if academic papers are needed
Email Composer uses scraped_content to personalize the email
Final email cites sources from scraped_urls

Previous: Template Parser

How search terms and template type are determined

Next: ArXiv Helper

Conditional academic paper fetching for RESEARCH templates

Overview

Steps

Advanced

Overview

Input Schema

Output Schema

Implementation Details

Dual-Query Strategy

Query 1: Background

Query 2: Publications

Exa Search Integration

Result Formatting

Execution Flow

Error Handling

Non-Fatal Errors (Pipeline Continues)

Fatal Errors (Pipeline Stops)

Logging & Observability

Tracked Metrics

Configuration

Performance Characteristics

Next Steps

Previous: Template Parser

Next: ArXiv Helper

Build docs developers (and LLMs) love

Overview

Steps

Advanced

​Overview

​Input Schema

​Output Schema

​Implementation Details

​Dual-Query Strategy

​Query 1: Background

​Query 2: Publications

​Exa Search Integration

​Result Formatting

​Execution Flow

​Error Handling

​Non-Fatal Errors (Pipeline Continues)

​Fatal Errors (Pipeline Stops)

​Logging & Observability

​Tracked Metrics

​Configuration

​Performance Characteristics

​Next Steps

Previous: Template Parser

Next: ArXiv Helper

Build docs developers (and LLMs) love

Overview

Input Schema

Output Schema

Implementation Details

Dual-Query Strategy

Query 1: Background

Query 2: Publications

Exa Search Integration

Result Formatting

Execution Flow

Error Handling

Non-Fatal Errors (Pipeline Continues)

Fatal Errors (Pipeline Stops)

Logging & Observability

Tracked Metrics

Configuration

Performance Characteristics

Next Steps