Overview
The Web Scraper is the second step in the email generation pipeline. It uses Exa Search API to find information about the recipient through a dual-query strategy: one for background/biography, one for publications/achievements. Purpose:- Fetch comprehensive information about the recipient
- Summarize background and professional details
- Gather publication and achievement data
- Provide cited sources for the Email Composer
Search API: Exa Search (AI-native search with citations)
Summarization: Claude Haiku 4.5 (via Exa’s built-in AI features)
Input Schema
The Web Scraper requires these fields from Step 1 (Template Parser):Search terms extracted by Template ParserConstraints:
- Must not be empty (Step 1 must complete first)
- Used to build targeted queries
Template classification from Step 1Values:
RESEARCH- Focus on academic publicationsBOOK- Focus on authored booksGENERAL- Broad professional information
Name of the email recipientUsage: Included in both background and publications queries
Recipient’s research area or topicUsage: Helps narrow search results to relevant content
Output Schema
The Web Scraper updatesPipelineData with:
Combined summary of background and publications informationFormat:Max Length: 3,000 characters (enforced for Email Composer context window)
URLs of all sources cited in the summaryExample:
Mapping of URLs to their full text content (for debugging)Structure:
Metadata about the scraping operation:
source- Always"exa_dual"(dual-query strategy)success- Boolean indicating if information was foundcitation_count- Number of sources citedbackground_length- Character count of background answerpublications_length- Character count of publications answercombined_length- Character count of combined summary
Implementation Details
Dual-Query Strategy
The scraper executes two parallel Exa queries for comprehensive coverage:pipeline/steps/web_scraper/main.py:30-41
Query 1: Background
Focuses on:- Current position and affiliation
- Educational background
- Research interests and expertise
- Lab or group affiliations
Query 2: Publications
Focuses on (varies bytemplate_type):
- RESEARCH: Recent papers, top publications, Google Scholar profile
- BOOK: Authored books, published works, writing credits
- GENERAL: Professional achievements, notable projects
pipeline/steps/web_scraper/prompts.py
Exa Search Integration
The step uses Exa’s native AI-powered search and summarization:- AI Search - Semantic search beyond keyword matching
- Auto-summarization - Built-in AI answers from search results
- Citations - Automatic source attribution
- Content Extraction - Full page text for each result
pipeline/steps/web_scraper/main.py:64-68
Result Formatting
The dual query results are combined with source attribution:pipeline/steps/web_scraper/main.py:43-51
Execution Flow
- Validate Input - Check that Step 1 completed successfully
- Build Queries - Create background and publications queries
- Execute Dual Search - Send both queries to Exa API (parallel)
- Check Results - Handle empty results gracefully
- Format Output - Combine summaries with source citations
- Update Pipeline Data - Store content, URLs, and metadata
- Return Success - With citation count and content length
pipeline/steps/web_scraper/main.py:53-122
Error Handling
Non-Fatal Errors (Pipeline Continues)
These errors are logged but don’t stop the pipeline:
- No Results Found - Sets
scraped_contentto fallback message - Empty Answer - Pipeline continues with minimal information
pipeline/steps/web_scraper/main.py:72-80
Fatal Errors (Pipeline Stops)
pipeline/steps/web_scraper/main.py:124-132
Logging & Observability
The step emits detailed logs to Logfire:pipeline/steps/web_scraper/main.py:104-112
Tracked Metrics
- Background answer length
- Publications answer length
- Combined answer length
- Citation count
- URLs found
- Execution duration
- Success/failure status
Configuration
The step is configured via environment variables:config/settings.py
Performance Characteristics
Average Execution Time: 5.3 seconds Breakdown:- Exa API dual query: ~3.5s
- Result processing: ~0.3s
- Network latency: ~1.5s
- Number of sources found (more citations = longer processing)
- Content length (longer pages = more extraction time)
- Network conditions
Next Steps
After the Web Scraper completes:- ArXiv Helper uses
template_typeto decide if academic papers are needed - Email Composer uses
scraped_contentto personalize the email - Final email cites sources from
scraped_urls
Previous: Template Parser
How search terms and template type are determined
Next: ArXiv Helper
Conditional academic paper fetching for RESEARCH templates
