PipelineData dataclass is the single source of truth during pipeline execution. Each step reads from previous step outputs and writes to specific fields.
PipelineData Structure
The full dataclass with field-level documentation:pipeline/models/core.py
Data Flow by Step
Step 1: Template Parser
Reads:email_template- Template with placeholdersrecipient_name- Recipient’s full namerecipient_interest- Research area/interest
search_terms- List of search queries for web scrapingtemplate_type- RESEARCH | BOOK | GENERAL classificationtemplate_analysis- Detailed parsing results (placeholders, tone, etc.)
Template type must be set (RESEARCH, BOOK, or GENERAL)
At least 1 search term extracted
Template analysis contains placeholder information
Step 2: Web Scraper
Reads:search_terms- Queries generated by Template Parserrecipient_name- For relevance filteringrecipient_interest- For relevance filtering
scraped_content- Summarized content (max 3000 chars)scraped_urls- List of successfully scraped URLsscraped_page_contents- Raw content per URL (for debugging)scraping_metadata- Stats about scraping success/failure
Scraped content not empty (at least 100 chars)
At least 1 URL successfully scraped
Content sanitized (no HTML tags, scripts, or excessive whitespace)
Step 3: ArXiv Enricher
Reads:template_type- Only runs if RESEARCH typerecipient_name- For ArXiv author searchrecipient_interest- For keyword-based searchscraped_content- For additional context
arxiv_papers- List of relevant papers (top 5)enrichment_metadata- Stats about paper fetching
If RESEARCH template and papers found, return at least 1 paper
Each paper must have title, abstract, url, authors fields
Non-fatal error if ArXiv API fails (pipeline continues without papers)
Step 4: Email Composer
Reads (ALL previous step outputs):email_template- Original templaterecipient_name- Recipient’s namerecipient_interest- Research interestscraped_content- Web scraping summaryarxiv_papers- Academic papers (if available)template_type- Template classificationtemplate_analysis- Placeholder information
final_email- Generated email textcomposition_metadata- Generation statsis_confident- Quality confidence flagmetadata- Final metadata for database (includesemail_id)
Final email not empty (at least 50 chars)
Recipient name appears in email
No unfilled placeholders (double braces, brackets, [INSERT, etc.)
If RESEARCH template, at least one paper mentioned (warning if not)
Email_id set in metadata after database write
Metadata Collection
Metadata is collected throughout the pipeline and stored in the final database record:Step Timings
Step Timings
Automatically collected by Used for:
BasePipelineStep.execute() method:- Performance monitoring
- Identifying slow steps
- Capacity planning
Error Tracking
Error Tracking
Non-fatal errors recorded via Used for:
pipeline_data.add_error():- Debugging quality issues
- Identifying flaky external APIs
- User support (explaining why email may be generic)
Content Sources
Content Sources
Tracks where information came from:Used for:
- Citation verification
- Quality auditing
- User transparency (showing sources)
Generation Context
Generation Context
High-level execution metadata:Used for:
- Cost tracking
- A/B testing different models
- Quality correlation analysis
Data Flow Diagram
Common Data Flow Patterns
Next Steps
Step 1: Template Parser
Learn how templates are analyzed and search terms are extracted
Step 2: Web Scraper
Understand the two-tier summarization and anti-hallucination safeguards
Step 3: ArXiv Enricher
See how academic papers are fetched and ranked by relevance
Step 4: Email Composer
Explore the three-attempt validation system and quality checks
