Why Scrape.do?
Scrape.do provides:- JavaScript rendering with Puppeteer/Playwright for SPA-heavy sites like X.com
- Residential proxies to bypass datacenter IP blocks
- Geo-targeting for location-specific results
- Reliable infrastructure with automatic retries and scaling
- Pay-as-you-go pricing with free tier available
Getting Started
Create a Scrape.do account
- Go to scrape.do
- Click Sign Up or Start Free Trial
- Complete registration with email verification
Get your API token
- Log in to the Scrape.do dashboard
- Navigate to API Tokens
- Copy your default token or create a new one
- Note your credit balance and rate limits
Add token to environment
Add your token to both client and server environments:Client (.env):Server (Supabase secrets):
Architecture
SENTi-radar uses Scrape.do in two layers:1. Client-Side Scraping
File:src/services/scrapeDoProvider.ts
2. Server-Side Scraping
File:supabase/functions/fetch-twitter/index.ts
The edge function fetches data in priority order:
- Scrape.do (X + Reddit in parallel)
- Parallel.ai (fallback social search)
- YouTube Data API (video comments)
- Algorithmic generation (guaranteed fallback)
Configuration Options
Scrape.do accepts these parameters via theScrapeDoOptions interface:
Enable JavaScript rendering using headless browser.
- X (Twitter): Required (
true) — X is a React SPA - Reddit JSON API: Not needed (
false) — direct JSON endpoint
Rendering consumes more credits. Disable for simple HTML pages.
Use residential/mobile proxies instead of datacenter IPs.
- When to enable:
- X.com blocks your requests (HTTP 403/407)
- Empty results despite valid query
- Rate limiting or CAPTCHA challenges
- Trade-off: Higher cost per request
Wait strategy before capturing HTML.
networkidle0: Wait until no network connections for 500ms (recommended for X)networkidle2: Wait until ≤2 connections for 500msload: Wait forloadeventdomcontentloaded: Wait for DOM ready (fastest)
ISO country code for geo-targeted results.Examples:
us— United Statesgb— United Kingdomin— Indiabr— Brazil
Platform-Specific Strategies
X (Twitter)
Target URL:- Primary: Extract
<article data-testid="tweet">elements - Extract text: Find
<div data-testid="tweetText">within each article - Extract author: Find
<span>with@usernameindata-testid="User-Name" - Fallback: Search for
<span lang="en">tags if articles not found
src/services/scrapeDoProvider.ts:93-144
- Parse JSON response directly
- Extract
data.children[].datafor post objects - Combine
title+selftextfor content - Use
created_utcfor timestamp
src/services/scrapeDoProvider.ts:152-183
Usage Examples
Basic Fetch
Fetch All Sources in Parallel
Advanced: Residential Proxies + Geo-Targeting
Error Handling
Edge Function Implementation
Thefetch-twitter edge function uses Scrape.do on the server side:
supabase/functions/fetch-twitter/index.ts:308-344
Cost Optimization
Credit Usage
- Standard request: 1-5 credits
- With rendering: 5-10 credits
- With super (residential): 25-50 credits
Check your Scrape.do dashboard for real-time credit usage and pricing.
Best Practices
-
Disable rendering when possible
- Reddit JSON API:
{ render: false } - Simple HTML pages:
{ render: false }
- Reddit JSON API:
-
Use
superonly when needed- Start with
super: false - Enable only if you get blocks (403/407)
- Start with
-
Cache results
- Store posts in Supabase database
- Implement client-side caching
- Avoid duplicate requests for same query
-
Batch requests
- Use
Promise.allSettled()for parallel fetching - Fetch X + Reddit simultaneously
- Use
-
Set reasonable limits
- X search: 15-25 posts
- Reddit search: 25 posts (API limit)
Troubleshooting
Empty Results from X
Problem:fetchXPosts() returns 0 posts
Solutions:
-
Enable residential proxies:
-
Increase wait time:
-
Check for login wall:
- X sometimes shows “Log in to X” page
- Residential proxies (
super: true) usually bypass this
-
Verify HTML structure:
- X changes their HTML frequently
- Check
parseXHtml()regex patterns - View raw HTML response in browser network tab
Reddit Returns HTML Instead of JSON
Problem: Reddit returns HTML login page instead of JSON Solutions:-
Enable residential proxies:
-
Verify URL:
- Ensure
.jsonextension:reddit.com/search.json - Check query encoding
- Ensure
HTTP 402 (Payment Required)
Problem: Scrape.do returns 402 status Solutions:- Check credit balance in dashboard
- Add credits to your account
- Review monthly quota limits
HTTP 429 (Rate Limited)
Problem: Too many requests Solutions:- Implement request throttling
- Add delays between requests
- Upgrade to higher rate limit plan
- Use exponential backoff retry logic
HTTP 403/407 (Blocked)
Problem: Target site blocking requests Solutions:-
Enable residential proxies:
-
Add geo-targeting:
- Contact Scrape.do support if issue persists
Extending to New Platforms
To add support for new social platforms (e.g., Hacker News, LinkedIn):API Reference
buildApiUrl(token, targetUrl, options)
Builds the Scrape.do proxy URL.
Parameters:
token(string): Scrape.do API tokentargetUrl(string): URL to scrapeoptions(ScrapeDoOptions): Configuration options
string - Full Scrape.do API URL
fetchXPosts(query, token, options)
Fetch X (Twitter) posts.
Returns: Promise<ScrapeDoResult>
fetchRedditPosts(query, token, options)
Fetch Reddit posts.
Returns: Promise<ScrapeDoResult>
fetchAllScrapeDoSources(query, token, sources, options)
Fetch from multiple sources in parallel.
Returns: Promise<{ results: ScrapeDoResult[], posts: ScrapedPost[] }>
Next Steps
Environment Variables
Configure all API tokens
API Keys
Get additional API keys for fallback sources
Scrape.do Docs
Official Scrape.do documentation
Scrape.do Dashboard
Monitor usage and credits