fetch-twitter

Overview

The fetch-twitter function collects social media posts related to a topic using multiple data sources with automatic fallback. Despite its name, it implements a sophisticated multi-source strategy:

Primary: Scrape.do (X/Twitter + Reddit)
Secondary: Parallel AI Social Search
Tertiary: YouTube API
Final Fallback: Algorithmic post generation

Posts are deduplicated and stored in the posts table with platform attribution.

Endpoint

POST https://your-project.supabase.co/functions/v1/fetch-twitter

This function is designed to be called internally by the analyze-topic orchestrator, not directly by client applications.

Request

topic_id

string

required

UUID of the topic to fetch posts for

Example Request

curl -X POST https://your-project.supabase.co/functions/v1/fetch-twitter \
  -H "Authorization: Bearer YOUR_SERVICE_ROLE_KEY" \
  -H "Content-Type: application/json" \
  -d '{"topic_id": "a3f5e8b1-4c2d-4e9f-8a1b-3c5d6e7f8a9b"}'

Response

success

boolean

required

Whether data was successfully fetched and stored

fetched

number

required

Number of posts retrieved from the source

inserted

number

required

Number of posts successfully inserted into the database (deduplication may cause this to be lower than fetched)

info

string

required

Information about the data source used

scrape_status

string

Success Response (Scrape.do)

{
  "success": true,
  "fetched": 30,
  "inserted": 28,
  "info": "Scrape.do (X: 18, Reddit: 12)",
  "scrape_status": "ok"
}

Success Response (YouTube Fallback)

{
  "success": true,
  "fetched": 15,
  "inserted": 15,
  "info": "YouTube Search API",
  "scrape_status": "quota"
}

Success Response (Algorithmic Fallback)

{
  "success": true,
  "fetched": 10,
  "inserted": 10,
  "info": "Algorithmic Generation",
  "scrape_status": "blocked"
}

Error Response

{
  "success": false,
  "error": "Topic not found"
}

Data Sources

1. Scrape.do (Primary)

Platforms: X/Twitter and Reddit Method: Parallel HTML scraping of both platforms Requirements: SCRAPE_DO_TOKEN environment variable Configuration:

const scrapeUrl = buildScrapeDoUrl(token, targetUrl, {
  render: true,
  waitUntil: 'networkidle0',
  geoCode: 'us'
});

X/Twitter Parsing Strategy:

Primary: data-testid="tweet" article elements
Fallback: lang="en" span elements

Reddit Parsing Strategy:

Primary: Reddit JSON API via Scrape.do
Extracts title + selftext from search results

Typical Yield: 15-25 posts from X, 8-15 posts from Reddit

2. Parallel AI (Secondary)

Method: AI-powered web search and content extraction Requirements: PARALLEL_API_KEY environment variable Activation: Only if Scrape.do fails or returns 0 posts Configuration:

const response = await fetch('https://api.parallel.ai/v1beta/search', {
  method: 'POST',
  body: JSON.stringify({
    objective: `Recent public opinions about "${topic.query}"`,
    max_results: 10
  })
});

Typical Yield: 5-10 posts from various web sources

3. YouTube API (Tertiary)

Method: Official YouTube Data API v3 Requirements: YOUTUBE_API_KEY environment variable Activation: Only if Scrape.do and Parallel AI both fail Search Parameters:

type: video
maxResults: 15
part: snippet

Data Structure: Video title + description combined as post text Typical Yield: 10-15 posts

4. Algorithmic Generation (Final Fallback)

Activation: If all other sources fail Method: Template-based synthetic posts Templates:

const templates = [
  `Huge buzz around ${topic.query} today!`,
  `People are really divided on the ${topic.query} situation.`,
  `The latest update for ${topic.query} is a total game changer.`,
  `Not impressed with ${topic.query} lately. Too much hype.`,
  `Why is nobody talking about ${topic.query}? This is massive.`
];

Platform: Marked as "simulated" Typical Yield: Exactly 10 posts

Post Schema

All posts are inserted into the posts table:

interface Post {
  topic_id: string;        // UUID reference to topics table
  platform: string;        // "x" | "reddit" | "youtube" | "web" | "simulated"
  external_id: string;     // Unique ID from source (e.g., "x_scrape_abc_0")
  author: string;          // Username prefixed with @ or u/
  content: string;         // Post text/comment body
  posted_at: string;       // ISO 8601 timestamp
}

Deduplication

Posts are upserted using a composite unique constraint:

await supabase.from('posts').upsert(postData, {
  onConflict: 'platform,external_id'
});

This ensures the same post is never inserted twice across multiple runs.

Error Handling

This function uses graceful degradation and returns HTTP 200 with success: false for most errors to allow the orchestrator to continue.

Scrape Status Codes

Status	HTTP Code	Meaning
`ok`	200	Successfully scraped data
`blocked`	403/407	Scrape.do detected bot behavior or login wall
`quota`	402/429	Scrape.do API quota exceeded
`no_token`	-	`SCRAPE_DO_TOKEN` environment variable not set
`error`	Other	Network error or unexpected response
`idle`	-	Initial state before scraping attempt

X/Twitter login walls are detected by checking for:

const isLoginWall = 
  html.toLowerCase().includes('log in to x') && 
  !html.includes('data-testid="tweet"');

When detected, scrape_status is set to "blocked" and fallback sources are used.

Performance

Typical execution time: 8-15 seconds

Scrape.do (fast): ~8-12s
Fallback to Parallel AI: ~12-18s
Fallback to YouTube: ~6-10s
Algorithmic generation: <1s

Rate Limits

Scrape.do: 1,000-10,000 requests/month depending on planParallel AI: Varies by subscription tierYouTube API: 10,000 units/day (1 search = 100 units)

Best Practices

Monitor scrape_status in your orchestrator to track quota usage

Set up all API keys (Scrape.do, Parallel, YouTube) for maximum reliability

Use the info field to determine which data source was used

Check inserted vs fetched to detect deduplication rates

analyze-topic - Orchestrator that calls this function
fetch-reddit - Reddit-specific scraper
fetch-youtube - YouTube comment scraper
analyze-sentiment - Analyzes collected posts

API Reference

Overview

Endpoint

Request

Example Request

Response

Success Response (Scrape.do)

Success Response (YouTube Fallback)

Success Response (Algorithmic Fallback)

Error Response

Data Sources

1. Scrape.do (Primary)

2. Parallel AI (Secondary)

3. YouTube API (Tertiary)

4. Algorithmic Generation (Final Fallback)

Post Schema

Deduplication

Error Handling

Scrape Status Codes

Performance

Rate Limits

Best Practices

Build docs developers (and LLMs) love

API Reference

​Overview

​Endpoint

​Request

​Example Request

​Response

​Success Response (Scrape.do)

​Success Response (YouTube Fallback)

​Success Response (Algorithmic Fallback)

​Error Response

​Data Sources

​1. Scrape.do (Primary)

​2. Parallel AI (Secondary)

​3. YouTube API (Tertiary)

​4. Algorithmic Generation (Final Fallback)

​Post Schema

​Deduplication

​Error Handling

​Scrape Status Codes

​Login Wall Detection

​Performance

​Rate Limits

​Best Practices

​Related Functions

Build docs developers (and LLMs) love

Overview

Endpoint

Request

Example Request

Response

Success Response (Scrape.do)

Success Response (YouTube Fallback)

Success Response (Algorithmic Fallback)

Error Response

Data Sources

1. Scrape.do (Primary)

2. Parallel AI (Secondary)

3. YouTube API (Tertiary)

4. Algorithmic Generation (Final Fallback)

Post Schema

Deduplication

Error Handling

Scrape Status Codes

Login Wall Detection

Performance

Rate Limits

Best Practices

Related Functions