Skip to main content

Overview

The fetch-twitter function collects social media posts related to a topic using multiple data sources with automatic fallback. Despite its name, it implements a sophisticated multi-source strategy:
  1. Primary: Scrape.do (X/Twitter + Reddit)
  2. Secondary: Parallel AI Social Search
  3. Tertiary: YouTube API
  4. Final Fallback: Algorithmic post generation
Posts are deduplicated and stored in the posts table with platform attribution.

Endpoint

POST https://your-project.supabase.co/functions/v1/fetch-twitter
This function is designed to be called internally by the analyze-topic orchestrator, not directly by client applications.

Request

topic_id
string
required
UUID of the topic to fetch posts for

Example Request

curl -X POST https://your-project.supabase.co/functions/v1/fetch-twitter \
  -H "Authorization: Bearer YOUR_SERVICE_ROLE_KEY" \
  -H "Content-Type: application/json" \
  -d '{"topic_id": "a3f5e8b1-4c2d-4e9f-8a1b-3c5d6e7f8a9b"}'

Response

success
boolean
required
Whether data was successfully fetched and stored
fetched
number
required
Number of posts retrieved from the source
inserted
number
required
Number of posts successfully inserted into the database (deduplication may cause this to be lower than fetched)
info
string
required
Information about the data source used
scrape_status
string
Status of the Scrape.do operation: “ok” | “blocked” | “quota” | “no_token” | “error” | “idle”

Success Response (Scrape.do)

{
  "success": true,
  "fetched": 30,
  "inserted": 28,
  "info": "Scrape.do (X: 18, Reddit: 12)",
  "scrape_status": "ok"
}

Success Response (YouTube Fallback)

{
  "success": true,
  "fetched": 15,
  "inserted": 15,
  "info": "YouTube Search API",
  "scrape_status": "quota"
}

Success Response (Algorithmic Fallback)

{
  "success": true,
  "fetched": 10,
  "inserted": 10,
  "info": "Algorithmic Generation",
  "scrape_status": "blocked"
}

Error Response

{
  "success": false,
  "error": "Topic not found"
}

Data Sources

1. Scrape.do (Primary)

Platforms: X/Twitter and Reddit Method: Parallel HTML scraping of both platforms Requirements: SCRAPE_DO_TOKEN environment variable Configuration:
const scrapeUrl = buildScrapeDoUrl(token, targetUrl, {
  render: true,
  waitUntil: 'networkidle0',
  geoCode: 'us'
});
X/Twitter Parsing Strategy:
  • Primary: data-testid="tweet" article elements
  • Fallback: lang="en" span elements
Reddit Parsing Strategy:
  • Primary: Reddit JSON API via Scrape.do
  • Extracts title + selftext from search results
Typical Yield: 15-25 posts from X, 8-15 posts from Reddit

2. Parallel AI (Secondary)

Method: AI-powered web search and content extraction Requirements: PARALLEL_API_KEY environment variable Activation: Only if Scrape.do fails or returns 0 posts Configuration:
const response = await fetch('https://api.parallel.ai/v1beta/search', {
  method: 'POST',
  body: JSON.stringify({
    objective: `Recent public opinions about "${topic.query}"`,
    max_results: 10
  })
});
Typical Yield: 5-10 posts from various web sources

3. YouTube API (Tertiary)

Method: Official YouTube Data API v3 Requirements: YOUTUBE_API_KEY environment variable Activation: Only if Scrape.do and Parallel AI both fail Search Parameters:
  • type: video
  • maxResults: 15
  • part: snippet
Data Structure: Video title + description combined as post text Typical Yield: 10-15 posts

4. Algorithmic Generation (Final Fallback)

Activation: If all other sources fail Method: Template-based synthetic posts Templates:
const templates = [
  `Huge buzz around ${topic.query} today!`,
  `People are really divided on the ${topic.query} situation.`,
  `The latest update for ${topic.query} is a total game changer.`,
  `Not impressed with ${topic.query} lately. Too much hype.`,
  `Why is nobody talking about ${topic.query}? This is massive.`
];
Platform: Marked as "simulated" Typical Yield: Exactly 10 posts

Post Schema

All posts are inserted into the posts table:
interface Post {
  topic_id: string;        // UUID reference to topics table
  platform: string;        // "x" | "reddit" | "youtube" | "web" | "simulated"
  external_id: string;     // Unique ID from source (e.g., "x_scrape_abc_0")
  author: string;          // Username prefixed with @ or u/
  content: string;         // Post text/comment body
  posted_at: string;       // ISO 8601 timestamp
}

Deduplication

Posts are upserted using a composite unique constraint:
await supabase.from('posts').upsert(postData, {
  onConflict: 'platform,external_id'
});
This ensures the same post is never inserted twice across multiple runs.

Error Handling

This function uses graceful degradation and returns HTTP 200 with success: false for most errors to allow the orchestrator to continue.

Scrape Status Codes

StatusHTTP CodeMeaning
ok200Successfully scraped data
blocked403/407Scrape.do detected bot behavior or login wall
quota402/429Scrape.do API quota exceeded
no_token-SCRAPE_DO_TOKEN environment variable not set
errorOtherNetwork error or unexpected response
idle-Initial state before scraping attempt

Login Wall Detection

X/Twitter login walls are detected by checking for:
const isLoginWall = 
  html.toLowerCase().includes('log in to x') && 
  !html.includes('data-testid="tweet"');
When detected, scrape_status is set to "blocked" and fallback sources are used.

Performance

Typical execution time: 8-15 seconds
  • Scrape.do (fast): ~8-12s
  • Fallback to Parallel AI: ~12-18s
  • Fallback to YouTube: ~6-10s
  • Algorithmic generation: <1s

Rate Limits

Scrape.do: 1,000-10,000 requests/month depending on planParallel AI: Varies by subscription tierYouTube API: 10,000 units/day (1 search = 100 units)

Best Practices

Monitor scrape_status in your orchestrator to track quota usage
Set up all API keys (Scrape.do, Parallel, YouTube) for maximum reliability
Use the info field to determine which data source was used
Check inserted vs fetched to detect deduplication rates

Build docs developers (and LLMs) love