Overview
Thefetch-twitter function collects social media posts related to a topic using multiple data sources with automatic fallback. Despite its name, it implements a sophisticated multi-source strategy:
- Primary: Scrape.do (X/Twitter + Reddit)
- Secondary: Parallel AI Social Search
- Tertiary: YouTube API
- Final Fallback: Algorithmic post generation
posts table with platform attribution.
Endpoint
This function is designed to be called internally by the
analyze-topic orchestrator, not directly by client applications.Request
UUID of the topic to fetch posts for
Example Request
Response
Whether data was successfully fetched and stored
Number of posts retrieved from the source
Number of posts successfully inserted into the database (deduplication may cause this to be lower than fetched)
Information about the data source used
Status of the Scrape.do operation: “ok” | “blocked” | “quota” | “no_token” | “error” | “idle”
Success Response (Scrape.do)
Success Response (YouTube Fallback)
Success Response (Algorithmic Fallback)
Error Response
Data Sources
1. Scrape.do (Primary)
Platforms: X/Twitter and Reddit Method: Parallel HTML scraping of both platforms Requirements:SCRAPE_DO_TOKEN environment variable
Configuration:
- Primary:
data-testid="tweet"article elements - Fallback:
lang="en"span elements
- Primary: Reddit JSON API via Scrape.do
- Extracts title + selftext from search results
2. Parallel AI (Secondary)
Method: AI-powered web search and content extraction Requirements:PARALLEL_API_KEY environment variable
Activation: Only if Scrape.do fails or returns 0 posts
Configuration:
3. YouTube API (Tertiary)
Method: Official YouTube Data API v3 Requirements:YOUTUBE_API_KEY environment variable
Activation: Only if Scrape.do and Parallel AI both fail
Search Parameters:
type: videomaxResults: 15part: snippet
4. Algorithmic Generation (Final Fallback)
Activation: If all other sources fail Method: Template-based synthetic posts Templates:"simulated"
Typical Yield: Exactly 10 posts
Post Schema
All posts are inserted into theposts table:
Deduplication
Posts are upserted using a composite unique constraint:Error Handling
Scrape Status Codes
| Status | HTTP Code | Meaning |
|---|---|---|
ok | 200 | Successfully scraped data |
blocked | 403/407 | Scrape.do detected bot behavior or login wall |
quota | 402/429 | Scrape.do API quota exceeded |
no_token | - | SCRAPE_DO_TOKEN environment variable not set |
error | Other | Network error or unexpected response |
idle | - | Initial state before scraping attempt |
Login Wall Detection
X/Twitter login walls are detected by checking for:scrape_status is set to "blocked" and fallback sources are used.
Performance
Typical execution time: 8-15 seconds- Scrape.do (fast): ~8-12s
- Fallback to Parallel AI: ~12-18s
- Fallback to YouTube: ~6-10s
- Algorithmic generation: <1s
Rate Limits
Best Practices
Monitor
scrape_status in your orchestrator to track quota usageSet up all API keys (Scrape.do, Parallel, YouTube) for maximum reliability
Use the
info field to determine which data source was usedCheck
inserted vs fetched to detect deduplication ratesRelated Functions
- analyze-topic - Orchestrator that calls this function
- fetch-reddit - Reddit-specific scraper
- fetch-youtube - YouTube comment scraper
- analyze-sentiment - Analyzes collected posts