Pipeline Flow
Template Parser
Analyze the email template and extract search parameters
- Identifies placeholders like
{{name}},{{research}} - Classifies template type (RESEARCH, BOOK, GENERAL)
- Generates search queries for web scraping
- Execution time: ~1.2 seconds
Web Scraper
Fetch and summarize relevant information about the recipient
- Google Custom Search API for URLs
- Playwright headless browser for content extraction
- Two-tier summarization for large content (>30K chars)
- Execution time: ~5.3 seconds (varies by website complexity)
ArXiv Enricher
Conditionally fetch academic papers (only if template_type == RESEARCH)
- Queries ArXiv API for relevant publications
- Four-factor relevance scoring
- Returns top 5 most relevant papers
- Execution time: ~0.8 seconds (when executed)
Typical Execution Time
Average: ~10.4 seconds (varies by template complexity and web scraping)Execution Time Breakdown
Execution Time Breakdown
| Step | Average Time | Variance | Primary Bottleneck |
|---|---|---|---|
| Template Parser | 1.2s | Low | LLM API call (Claude Haiku) |
| Web Scraper | 5.3s | High | Playwright rendering + JavaScript execution |
| ArXiv Enricher | 0.8s | Low | ArXiv API response time |
| Email Composer | 3.1s | Medium | LLM generation + quality validation |
| Total | 10.4s | Medium | Network latency + LLM processing |
- Web Scraper: Highly dependent on website complexity, JavaScript load time, and number of pages
- Email Composer: Validation retries can add 2-6 seconds if first attempt fails quality checks
Design Goals
Stateless In-Memory Processing
All pipeline state lives in a singlePipelineData object passed through each step. No intermediate database writes—only the final email is persisted.
pipeline/models/core.py
- Performance: No I/O between steps, all operations in RAM
- Simplicity: Only 1 database write per pipeline execution
- Scalability: Workers scale horizontally with no database bottleneck
- Observability: Logfire captures full execution history without DB writes
Observable by Default
Every pipeline step is automatically instrumented with Logfire spans for distributed tracing:pipeline/core/runner.py
Resilient Error Handling
The pipeline distinguishes between fatal and non-fatal errors: Fatal Errors (stop pipeline):- Template Parser fails (can’t proceed without search terms)
- Email Composer database write fails
- Invalid input data (Pydantic validation)
- Some URLs fail to scrape (continue with successful ones)
- ArXiv API timeout (continue without papers)
- Email validation warnings (still persist email)
Key Features
Cost-Effective
Average cost per email: $0.027Uses Claude Haiku for extraction tasks, Sonnet only for final composition
Anti-Hallucination
Multi-source verification in web scrapingFacts must appear in multiple pages or marked as
[UNCERTAIN]Quality Assurance
Three-attempt validation systemEnsures recipient name appears, no unfilled placeholders, mentions research
Memory Efficient
Runs on 512MB RAMSmart chunking and sequential processing to stay within memory limits
Pipeline Execution Diagram
Next Steps
Architecture
Learn about the BasePipelineStep pattern, PipelineRunner orchestration, and stateless design
Data Flow
Understand how PipelineData flows through steps and what each step reads/writes
