ArXiv Helper

Overview

The ArXiv Helper is the third step in the email generation pipeline. It conditionally fetches academic papers from ArXiv, but only if the template_type is RESEARCH. For BOOK or GENERAL templates, this step is skipped entirely. Purpose:

Fetch recent academic papers authored by the recipient
Enrich email context with specific research titles and URLs
Enable personalized references to the recipient’s work

Timing: ~0.8 seconds (when executed), ~0.01 seconds (when skipped)
API: ArXiv API (free, no authentication required)
Conditional: Only runs if template_type == RESEARCH

Input Schema

The ArXiv Helper requires these fields from Step 1 (Template Parser):

template_type

enum

required

Template classification from Step 1Conditional Logic:

RESEARCH → Fetch papers from ArXiv
BOOK → Skip (no papers needed)
GENERAL → Skip (no papers needed)

recipient_name

string

required

Name of the email recipientUsage: Used as author name in ArXiv search queryExample: "Jane Smith" → ArXiv query: au:"Jane Smith"

recipient_interest

string

required

Recipient’s research area or topicUsage: Validated to ensure template context is complete

Output Schema

The ArXiv Helper updates PipelineData with:

arxiv_papers

object[]

Array of ArXiv papers (up to 5 most recent)Structure:

[
  {
    "title": "Deep Learning for Medical Image Segmentation",
    "arxiv_url": "https://arxiv.org/abs/2301.12345",
    "year": "2023",
    "summary": "Abstract text...",
    "authors": ["Jane Smith", "John Doe"]
  }
]

Empty Array When:

template_type != RESEARCH (skipped)
No papers found on ArXiv
ArXiv API error (non-fatal)

enrichment_metadata

object

Metadata about the enrichment operation:When Skipped:

{
  "skipped": true,
  "reason": "template_type is BOOK, not RESEARCH"
}

When Executed:

{
  "skipped": false,
  "papers_found": 3,
  "search_query": "Jane Smith"
}

On Error:

{
  "error": "ArXiv API timeout",
  "papers_found": 0
}

Implementation Details

Conditional Execution

The step first checks template type before searching:

async def _execute_step(self, pipeline_data: PipelineData) -> StepResult:
    # Step 1: Check template type
    if pipeline_data.template_type != TemplateType.RESEARCH:
        logfire.info(
            "Skipping ArXiv search - not RESEARCH template",
            template_type=pipeline_data.template_type.value
        )
        
        # Set empty results
        pipeline_data.arxiv_papers = []
        pipeline_data.enrichment_metadata = {
            "skipped": True,
            "reason": f"template_type is {pipeline_data.template_type.value}, not RESEARCH"
        }
        
        return StepResult(
            success=True,
            step_name=self.step_name,
            metadata={"skipped": True}
        )

Source: pipeline/steps/arxiv_helper/main.py:33-55

ArXiv Search

When template_type == RESEARCH, the step searches by author name:

papers = await search_arxiv(
    author_name=pipeline_data.recipient_name
)

Search Strategy:

Query: au:"[recipient_name]" (exact author name match)
Sort: By submission date (most recent first)
Max results: 5 papers
Timeout: 10 seconds

Source: pipeline/steps/arxiv_helper/main.py:63-65

Paper Data Model

Each paper is serialized with:

class ArxivPaper:
    title: str          # Paper title
    arxiv_url: str      # https://arxiv.org/abs/[id]
    year: str           # Publication year
    summary: str        # Abstract (truncated to 500 chars)
    authors: List[str]  # List of author names
    
    def to_dict(self) -> dict:
        """Serialize for PipelineData storage."""
        return {
            "title": self.title,
            "arxiv_url": self.arxiv_url,
            "year": self.year,
            "summary": self.summary[:500],  # Truncate for context window
            "authors": self.authors
        }

Source: pipeline/steps/arxiv_helper/utils.py

Execution Flow

When Template Type is RESEARCH

Check Template Type - Confirm template_type == RESEARCH
Search ArXiv - Query by author name
Handle Empty Results - Set empty array if no papers found
Serialize Papers - Convert to dict format for PipelineData
Update Pipeline Data - Store papers and metadata
Return Success - With paper count and titles

# Search ArXiv
papers = await search_arxiv(
    author_name=pipeline_data.recipient_name
)

if not papers:
    logfire.warning(
        "No papers found on ArXiv",
        recipient_name=pipeline_data.recipient_name
    )
    pipeline_data.arxiv_papers = []
    return StepResult(
        success=True,
        warnings=["No papers found on ArXiv"]
    )

# Update PipelineData
pipeline_data.arxiv_papers = [
    paper.to_dict() for paper in papers
]

logfire.info(
    "ArXiv papers found",
    total_papers=len(papers),
    paper_titles=[p.title for p in papers[:3]]
)

Source: pipeline/steps/arxiv_helper/main.py:57-111

When Template Type is NOT RESEARCH

Check Template Type - Detect BOOK or GENERAL
Log Skip Reason - Record why search was skipped
Set Empty Results - arxiv_papers = []
Update Metadata - Mark as skipped with reason
Return Success - With skip metadata

Error Handling

Non-Fatal Errors (Pipeline Continues)

ArXiv errors are always non-fatal - the pipeline continues with empty results:

No Papers Found - Common for non-academics or recent PhDs
API Timeout - ArXiv API slow or unresponsive
Network Errors - Connection issues
Parsing Errors - Malformed ArXiv XML responses

Philosophy: Better to send an email without papers than to fail entirely.

except Exception as e:
    logfire.error(
        "Error in ArXiv helper",
        error=str(e),
        error_type=type(e).__name__
    )
    
    # ArXiv errors are not fatal - continue pipeline
    pipeline_data.arxiv_papers = []
    pipeline_data.enrichment_metadata = {
        "error": str(e),
        "papers_found": 0
    }
    
    return StepResult(
        success=True,  # Still success!
        step_name=self.step_name,
        warnings=[f"ArXiv search failed: {str(e)}"]
    )

Source: pipeline/steps/arxiv_helper/main.py:113-132

Logging & Observability

The step emits conditional logs based on execution path:

When Skipped

logfire.info(
    "Skipping ArXiv search - not RESEARCH template",
    template_type=pipeline_data.template_type.value
)

When Executed

logfire.info(
    "Searching ArXiv for papers",
    recipient_name=pipeline_data.recipient_name
)

logfire.info(
    "ArXiv papers found",
    total_papers=len(papers),
    paper_titles=[p.title for p in papers[:3]]
)

On Error

logfire.error(
    "Error in ArXiv helper",
    error=str(e),
    error_type=type(e).__name__
)

Source: pipeline/steps/arxiv_helper/main.py:36-90

Performance Characteristics

Execution Time:

When Skipped: ~0.01 seconds (conditional check only)
When Executed: ~0.8 seconds average
- ArXiv API query: ~0.6s
- XML parsing: ~0.1s
- Data serialization: ~0.1s

Variance Factors:

ArXiv API response time (varies by load)
Number of papers returned (more papers = longer parsing)
Network latency

ArXiv API Details

API Endpoint

GET https://export.arxiv.org/api/query

Query Parameters

params = {
    "search_query": f'au:"{author_name}"',
    "sortBy": "submittedDate",
    "sortOrder": "descending",
    "max_results": 5
}

Response Format

ArXiv returns XML (Atom feed format):

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <title>Paper Title</title>
    <id>http://arxiv.org/abs/2301.12345</id>
    <published>2023-01-15T12:00:00Z</published>
    <summary>Abstract text...</summary>
    <author>
      <name>Jane Smith</name>
    </author>
  </entry>
</feed>

Source: pipeline/steps/arxiv_helper/utils.py

Configuration

No API key required (ArXiv is free and open). Configuration via environment variables:

# ArXiv API timeout (seconds)
ARXIV_TIMEOUT=10.0

# Max papers to fetch
ARXIV_MAX_RESULTS=5

Source: config/settings.py

Next Steps

After the ArXiv Helper completes:

Email Composer uses arxiv_papers to reference specific research in the email
Papers are cited in the final email (if available)
Metadata is stored in database for analytics

Previous: Web Scraper

How background information is fetched and summarized

Next: Email Composer

Final email generation and database write

Overview

Steps

Advanced

Overview

Input Schema

Output Schema

Implementation Details

Conditional Execution

ArXiv Search

Paper Data Model

Execution Flow

When Template Type is RESEARCH

When Template Type is NOT RESEARCH

Error Handling

Non-Fatal Errors (Pipeline Continues)

Logging & Observability

When Skipped

When Executed

On Error

Performance Characteristics

ArXiv API Details

API Endpoint

Query Parameters

Response Format

Configuration

Next Steps

Previous: Web Scraper

Next: Email Composer

Build docs developers (and LLMs) love

Overview

Steps

Advanced

​Overview

​Input Schema

​Output Schema

​Implementation Details

​Conditional Execution

​ArXiv Search

​Paper Data Model

​Execution Flow

​When Template Type is RESEARCH

​When Template Type is NOT RESEARCH

​Error Handling

​Non-Fatal Errors (Pipeline Continues)

​Logging & Observability

​When Skipped

​When Executed

​On Error

​Performance Characteristics

​ArXiv API Details

​API Endpoint

​Query Parameters

​Response Format

​Configuration

​Next Steps

Previous: Web Scraper

Next: Email Composer

Build docs developers (and LLMs) love

Overview

Input Schema

Output Schema

Implementation Details

Conditional Execution

ArXiv Search

Paper Data Model

Execution Flow

When Template Type is RESEARCH

When Template Type is NOT RESEARCH

Error Handling

Non-Fatal Errors (Pipeline Continues)

Logging & Observability

When Skipped

When Executed

On Error

Performance Characteristics

ArXiv API Details

API Endpoint

Query Parameters

Response Format

Configuration

Next Steps