Scrape.do Provider
The Scrape.do Provider (scrapeDoProvider.ts) is a modular service that fetches live social media content from X (Twitter) and Reddit using Scrape.do ’s API. It supports JavaScript rendering, residential proxies, and full HTML access to heavy client-side pages.
Architecture
The provider is designed to be extensible. To add new platforms (e.g., HackerNews, LinkedIn), simply:
Create a new provider function (e.g., fetchHackerNewsPosts)
Ensure it accepts a token and options parameter
Return a ScrapeDoResult object
Add it to the fetchAllScrapeDoSources aggregator
Core Types
ScrapedPost
Represents a single scraped post from any platform. Unique identifier for the post (e.g., x_0, reddit_abc123)
The post content text (HTML entities decoded, tags stripped)
Username or handle (e.g., @user, u/redditor)
platform
'x' | 'reddit' | 'web'
required
Source platform identifier
Link to the original post or search page
ISO 8601 timestamp (e.g., 2026-03-12T14:30:00.000Z)
export interface ScrapedPost {
id : string ;
text : string ;
author : string ;
platform : "x" | "reddit" | "web" ;
url : string ;
postedAt : string ;
}
ScrapeDoOptions
Configuration options for Scrape.do API requests. Enable JavaScript rendering (essential for X and Reddit)
Use residential/mobile proxies to bypass datacenter detection
waitUntil
'networkidle0' | 'networkidle2' | 'load' | 'domcontentloaded'
default: "'networkidle0'"
Wait strategy before returning HTML
ISO country code for geo-targeted results (e.g., 'us', 'gb', 'in')
export interface ScrapeDoOptions {
render ?: boolean ;
super ?: boolean ;
waitUntil ?: "networkidle0" | "networkidle2" | "load" | "domcontentloaded" ;
geoCode ?: string ;
}
ScrapeDoResult
Result object returned by all provider functions. Array of successfully scraped posts
Human-readable label (e.g., 'X via Scrape.do', 'Reddit via Scrape.do')
status
'success' | 'partial' | 'error'
required
success: Posts retrieved successfully
partial: Some data retrieved but incomplete
error: Request failed
Error message if status is error or partial
export type ScrapeDoStatus = "success" | "partial" | "error" ;
export interface ScrapeDoResult {
posts : ScrapedPost [];
source : string ;
status : ScrapeDoStatus ;
error ?: string ;
}
Helper Functions
buildApiUrl
Constructs the Scrape.do proxy URL for a given target URL and options.
export function buildApiUrl (
token : string ,
targetUrl : string ,
options : ScrapeDoOptions = {}
) : string
Parameters:
token (string): Scrape.do API token (from VITE_SCRAPE_TOKEN)
targetUrl (string): The URL to scrape (e.g., https://x.com/search?q=...)
options (ScrapeDoOptions): Optional configuration
Returns: Full Scrape.do API URL with query parameters
Example:
const apiUrl = buildApiUrl (
"your-token" ,
"https://x.com/search?q=AI%20regulation&f=live" ,
{ render: true , waitUntil: "networkidle0" , geoCode: "us" }
);
// https://api.scrape.do?token=your-token&url=...&render=true&waitUntil=networkidle0&geoCode=us
decodeEntities
Decodes common HTML entities in scraped text.
export function decodeEntities ( text : string ) : string
Parameters:
text (string): Raw text with HTML entities
Returns: Decoded text
Example:
const decoded = decodeEntities ( "Tech & Innovation <2026>" );
// "Tech & Innovation <2026>"
Removes all HTML tags from a string and normalizes whitespace.
export function stripTags ( html : string ) : string
Example:
const clean = stripTags ( "<p>Breaking: <strong>New policy</strong> announced</p>" );
// "Breaking: New policy announced"
parseXHtml
Parses rendered X.com search HTML into ScrapedPost objects.
export function parseXHtml ( html : string , query : string ) : ScrapedPost []
Strategy:
Primary : Extract tweets from <article data-testid="tweet"> elements with <div data-testid="tweetText">
Fallback : Grab <span lang="en"> elements longer than 20 characters
Parameters:
html (string): Rendered HTML from X.com
query (string): Search query (for URL construction)
Returns: Array of up to 20 ScrapedPost objects
Example:
const posts = parseXHtml ( htmlFromXSearch , "climate change" );
console . log ( posts [ 0 ]);
// {
// id: "x_0",
// text: "Urgent action needed on climate...",
// author: "@climateactivist",
// platform: "x",
// url: "https://x.com/search?q=climate%20change&f=live",
// postedAt: "2026-03-12T10:00:00.000Z"
// }
parseRedditJson
Parses Reddit’s JSON search API response into ScrapedPost objects.
export function parseRedditJson ( data : unknown , query : string ) : ScrapedPost []
Parameters:
data (unknown): Parsed JSON from reddit.com/search.json
query (string): Search query
Returns: Array of ScrapedPost objects
Example:
const json = await res . json ();
const posts = parseRedditJson ( json , "AI regulation" );
console . log ( posts [ 0 ]);
// {
// id: "reddit_abc123",
// text: "New AI regulation bill discussion...",
// author: "u/policyexpert",
// platform: "reddit",
// url: "https://www.reddit.com/...",
// postedAt: "2026-03-12T09:45:00.000Z"
// }
Provider Functions
fetchXPosts
Fetches live X (Twitter) posts for a query via Scrape.do.
export async function fetchXPosts (
query : string ,
token : string ,
options : ScrapeDoOptions = {}
) : Promise < ScrapeDoResult >
Parameters:
query (string): Search term (e.g., "AI regulation")
token (string): Scrape.do API token
options (ScrapeDoOptions): Optional overrides
Returns: ScrapeDoResult with X posts
Default behavior:
JavaScript rendering enabled (render: true)
Waits for network idle (waitUntil: "networkidle0")
Targets live search results (&f=live)
Example:
const result = await fetchXPosts (
"climate summit" ,
process . env . VITE_SCRAPE_TOKEN ,
{ geoCode: "us" }
);
if ( result . status === "success" ) {
console . log ( `Fetched ${ result . posts . length } X posts` );
result . posts . forEach ( post => console . log ( post . text ));
} else {
console . error ( result . error );
}
Error handling:
No token
HTTP errors
No posts parsed
if ( ! token ) {
return {
posts: [],
source: "X via Scrape.do" ,
status: "error" ,
error: "VITE_SCRAPE_TOKEN not configured"
};
}
fetchRedditPosts
Fetches Reddit posts via Scrape.do using Reddit’s JSON API.
export async function fetchRedditPosts (
query : string ,
token : string ,
options : ScrapeDoOptions = {}
) : Promise < ScrapeDoResult >
Parameters:
query (string): Search term
token (string): Scrape.do API token
options (ScrapeDoOptions): Optional overrides
Returns: ScrapeDoResult with Reddit posts
Default behavior:
JavaScript rendering disabled (render: false) — uses JSON endpoint
Sorts by new (&sort=new)
Fetches up to 25 posts (&limit=25)
Reddit’s JSON endpoint is more reliable than HTML parsing, yet still benefits from Scrape.do’s residential proxies when Reddit blocks datacenter IPs.
Example:
const result = await fetchRedditPosts (
"SaaS tools" ,
process . env . VITE_SCRAPE_TOKEN
);
if ( result . status === "success" ) {
console . log ( `Reddit posts: ${ result . posts . length } ` );
} else if ( result . status === "partial" ) {
console . warn ( result . error ); // e.g., "Reddit returned non-JSON (may require super=true)"
}
Error handling:
try {
data = JSON . parse ( text );
} catch {
return {
posts: [],
source: "Reddit via Scrape.do" ,
status: "partial" ,
error: "Reddit returned non-JSON (may require super=true)"
};
}
Aggregated Provider
fetchAllScrapeDoSources
Fetches from all requested sources in parallel and returns merged results.
export async function fetchAllScrapeDoSources (
query : string ,
token : string ,
sources : Array < "x" | "reddit" > = [ "x" , "reddit" ],
options : ScrapeDoOptions = {}
) : Promise <{ results : ScrapeDoResult []; posts : ScrapedPost [] }>
Parameters:
query (string): Search term
token (string): Scrape.do API token
sources (Array): Platforms to query (default: ["x", "reddit"])
options (ScrapeDoOptions): Applied to all sources
Returns:
results (ScrapeDoResult[]): Per-source results with status/error info
posts (ScrapedPost[]): Flattened array of all posts from all sources
Example:
const { results , posts } = await fetchAllScrapeDoSources (
"AI regulation" ,
process . env . VITE_SCRAPE_TOKEN ! ,
[ "x" , "reddit" ],
{ geoCode: "us" , super: true }
);
console . log ( `Total posts: ${ posts . length } ` );
results . forEach ( result => {
if ( result . status === "success" ) {
console . log ( `✓ ${ result . source } : ${ result . posts . length } posts` );
} else {
console . error ( `✗ ${ result . source } : ${ result . error } ` );
}
});
Error resilience:
The function uses Promise.allSettled to ensure one source failure doesn’t break the entire request:
const settled = await Promise . allSettled ( fetchers );
const results : ScrapeDoResult [] = settled . map (( r , i ) => {
if ( r . status === "fulfilled" ) return r . value ;
const label = sources [ i ] === "x" ? "X via Scrape.do" : "Reddit via Scrape.do" ;
return {
posts: [],
source: label ,
status: "error" as const ,
error: String ( r . reason )
};
});
Usage in Application
From TopicDetail.tsx (lines 462-480):
// Fetch YouTube comments + Google News + Scrape.do (X & Reddit) in parallel
const [ ytResult , headlinesResult , scrapeResult ] = await Promise . allSettled ([
fetchYouTubeComments ( topic . title ),
fetchNewsHeadlines ( topic . title ),
fetchAllScrapeDoSources ( topic . title , SCRAPE_TOKEN , [ 'x' , 'reddit' ]),
]);
const { results : scrapeDoResults , posts : scrapedPosts } = scrapeResult . status === 'fulfilled'
? scrapeResult . value
: { results: [], posts: [] };
console . log ( `Data: ${ ytCount } YT comments, ${ rssHeadlines . length } headlines, ${ scrapedPosts . length } scraped posts (X/Reddit)` );
// Notify UI of Scrape.do per-source status (for status chips / error display)
if ( onScrapeDoResults && scrapeDoResults . length > 0 ) {
onScrapeDoResults ( scrapeDoResults );
}
// Step 2: Analyze everything together
const analysis = analyzeTopicFully ( topic . title , rssHeadlines , comments , scrapedPosts , scrapeDoResults );
Configuration
Environment Variables
VITE_SCRAPE_TOKEN = your_scrape_do_api_token_here
Security Note: VITE_ prefixed variables are embedded in the client-side JS bundle and visible in browser DevTools. For production, move Scrape.do calls to Supabase Edge Functions and use server-side secrets.
Best Practices
Rate Limiting
Scrape.do has rate limits based on your plan. Use Promise.allSettled to fetch multiple sources in parallel without blocking on failures:
const { results , posts } = await fetchAllScrapeDoSources (
query ,
token ,
[ "x" , "reddit" ]
);
Geo-Targeting
Use geoCode for region-specific results:
const usResults = await fetchXPosts ( "election" , token , { geoCode: "us" });
const ukResults = await fetchXPosts ( "election" , token , { geoCode: "gb" });
Proxy Escalation
If you encounter blocks, enable residential proxies:
const result = await fetchRedditPosts ( query , token , { super: true });
Error Handling Pattern
const result = await fetchXPosts ( query , token );
switch ( result . status ) {
case "success" :
console . log ( `✓ ${ result . posts . length } posts` );
break ;
case "partial" :
console . warn ( `⚠ Partial data: ${ result . error } ` );
// Still use result.posts if available
break ;
case "error" :
console . error ( `✗ Failed: ${ result . error } ` );
// Fallback to cached data or show error to user
break ;
}