Scrape.do Integration

SENTi-radar uses Scrape.do as its primary data provider for scraping JavaScript-heavy social media platforms like X (Twitter) and Reddit. This guide covers setup, configuration, and advanced usage.

Why Scrape.do?

Scrape.do provides:

JavaScript rendering with Puppeteer/Playwright for SPA-heavy sites like X.com
Residential proxies to bypass datacenter IP blocks
Geo-targeting for location-specific results
Reliable infrastructure with automatic retries and scaling
Pay-as-you-go pricing with free tier available

Getting Started

Create a Scrape.do account

Go to scrape.do
Click Sign Up or Start Free Trial
Complete registration with email verification

Get your API token

Log in to the Scrape.do dashboard
Navigate to API Tokens
Copy your default token or create a new one
Note your credit balance and rate limits

Add token to environment

Add your token to both client and server environments:Client (.env):

VITE_SCRAPE_TOKEN=your-scrape-do-token-here

Server (Supabase secrets):

supabase secrets set SCRAPE_DO_TOKEN=your-scrape-do-token-here

Verify integration

Start your development server and test scraping:

npm run dev

Create a new topic in the UI and watch the browser console for Scrape.do requests.

Architecture

SENTi-radar uses Scrape.do in two layers:

1. Client-Side Scraping

File: src/services/scrapeDoProvider.ts

import { fetchXPosts, fetchRedditPosts } from '@/services/scrapeDoProvider';

// Fetch X posts
const xResult = await fetchXPosts('climate change', token, {
  render: true,
  super: true,          // residential proxies
  waitUntil: 'networkidle0',
  geoCode: 'us',        // US-based results
});

// Fetch Reddit posts
const redditResult = await fetchRedditPosts('climate change', token, {
  render: false,        // Reddit JSON API doesn't need rendering
  super: true,
});

2. Server-Side Scraping

File: supabase/functions/fetch-twitter/index.ts The edge function fetches data in priority order:

Scrape.do (X + Reddit in parallel)
Parallel.ai (fallback social search)
YouTube Data API (video comments)
Algorithmic generation (guaranteed fallback)

Configuration Options

Scrape.do accepts these parameters via the ScrapeDoOptions interface:

render

boolean

default:"true"

Enable JavaScript rendering using headless browser.

X (Twitter): Required (true) — X is a React SPA
Reddit JSON API: Not needed (false) — direct JSON endpoint

Rendering consumes more credits. Disable for simple HTML pages.

super

boolean

default:"false"

Use residential/mobile proxies instead of datacenter IPs.

When to enable:
- X.com blocks your requests (HTTP 403/407)
- Empty results despite valid query
- Rate limiting or CAPTCHA challenges
Trade-off: Higher cost per request

// Enable residential proxies for X
const result = await fetchXPosts(query, token, { super: true });

waitUntil

string

default:"networkidle0"

Wait strategy before capturing HTML.

networkidle0: Wait until no network connections for 500ms (recommended for X)
networkidle2: Wait until ≤2 connections for 500ms
load: Wait for load event
domcontentloaded: Wait for DOM ready (fastest)

// Wait for full network idle (most reliable for X)
{ waitUntil: 'networkidle0' }

geoCode

string

default:"none"

ISO country code for geo-targeted results.Examples:

us — United States
gb — United Kingdom
in — India
br — Brazil

// Get US-specific X results
const result = await fetchXPosts(query, token, { geoCode: 'us' });

Platform-Specific Strategies

X (Twitter)

Target URL:

https://x.com/search?q={query}&src=typed_query&f=live

Recommended settings:

{
  render: true,
  waitUntil: 'networkidle0',
  super: true,  // Enable if blocked
  geoCode: 'us'
}

Parsing strategy:

Primary: Extract <article data-testid="tweet"> elements
Extract text: Find <div data-testid="tweetText"> within each article
Extract author: Find <span> with @username in data-testid="User-Name"
Fallback: Search for <span lang="en"> tags if articles not found

Code reference: src/services/scrapeDoProvider.ts:93-144

Target URL:

https://www.reddit.com/search.json?q={query}&sort=new&limit=25

Recommended settings:

{
  render: false,  // JSON endpoint doesn't need rendering
  super: true     // Enable if blocked
}

Parsing strategy:

Parse JSON response directly
Extract data.children[].data for post objects
Combine title + selftext for content
Use created_utc for timestamp

Code reference: src/services/scrapeDoProvider.ts:152-183

Usage Examples

Basic Fetch

import { fetchXPosts, fetchRedditPosts } from '@/services/scrapeDoProvider';

// Simple X fetch
const xResult = await fetchXPosts('AI safety', import.meta.env.VITE_SCRAPE_TOKEN);
console.log(`Found ${xResult.posts.length} X posts`);
console.log(`Status: ${xResult.status}`);

// Simple Reddit fetch
const redditResult = await fetchRedditPosts('AI safety', import.meta.env.VITE_SCRAPE_TOKEN);
console.log(`Found ${redditResult.posts.length} Reddit posts`);

Fetch All Sources in Parallel

import { fetchAllScrapeDoSources } from '@/services/scrapeDoProvider';

const { results, posts } = await fetchAllScrapeDoSources(
  'climate change',
  import.meta.env.VITE_SCRAPE_TOKEN,
  ['x', 'reddit'],  // sources to fetch
  { super: true, geoCode: 'us' }  // applied to all sources
);

console.log(`Total posts: ${posts.length}`);
results.forEach(r => {
  console.log(`${r.source}: ${r.status} (${r.posts.length} posts)`);
});

Advanced: Residential Proxies + Geo-Targeting

// Fetch US-based X posts with residential proxies
const result = await fetchXPosts(
  'US election 2024',
  token,
  {
    render: true,
    super: true,           // residential proxies
    waitUntil: 'networkidle0',
    geoCode: 'us'          // US geo-targeting
  }
);

if (result.status === 'success') {
  console.log(`Successfully fetched ${result.posts.length} US-based posts`);
} else {
  console.error(`Error: ${result.error}`);
}

Error Handling

const result = await fetchXPosts(query, token, { super: true });

switch (result.status) {
  case 'success':
    console.log(`✓ ${result.posts.length} posts fetched`);
    break;
  
  case 'partial':
    console.warn(`⚠ Partial results: ${result.error}`);
    // Still process result.posts (may have some data)
    break;
  
  case 'error':
    console.error(`✗ Failed: ${result.error}`);
    // Implement fallback logic
    break;
}

Edge Function Implementation

The fetch-twitter edge function uses Scrape.do on the server side:

// supabase/functions/fetch-twitter/index.ts (simplified)

const SCRAPE_DO_TOKEN = Deno.env.get("SCRAPE_DO_TOKEN") || "";

if (SCRAPE_DO_TOKEN) {
  const xUrl = `https://x.com/search?q=${encodeURIComponent(query)}&f=live`;
  const redditUrl = `https://www.reddit.com/search.json?q=${encodeURIComponent(query)}&sort=new&limit=25`;

  // Fetch X and Reddit in parallel
  const [xResult, redditResult] = await Promise.allSettled([
    fetch(buildScrapeDoUrl(SCRAPE_DO_TOKEN, xUrl, { 
      render: true, 
      waitUntil: "networkidle0" 
    })),
    fetch(buildScrapeDoUrl(SCRAPE_DO_TOKEN, redditUrl, { 
      render: false 
    }))
  ]);

  // Parse results...
}

Code reference: supabase/functions/fetch-twitter/index.ts:308-344

Cost Optimization

Credit Usage

Standard request: 1-5 credits
With rendering: 5-10 credits
With super (residential): 25-50 credits

Check your Scrape.do dashboard for real-time credit usage and pricing.

Best Practices

Disable rendering when possible
- Reddit JSON API: { render: false }
- Simple HTML pages: { render: false }
Use super only when needed
- Start with super: false
- Enable only if you get blocks (403/407)
Cache results
- Store posts in Supabase database
- Implement client-side caching
- Avoid duplicate requests for same query
Batch requests
- Use Promise.allSettled() for parallel fetching
- Fetch X + Reddit simultaneously
Set reasonable limits
- X search: 15-25 posts
- Reddit search: 25 posts (API limit)

Troubleshooting

Empty Results from X

Problem: fetchXPosts() returns 0 posts Solutions:

Enable residential proxies:
```
{ super: true }
```

Increase wait time:

{ waitUntil: 'networkidle0' }  // Most reliable

Check for login wall:
- X sometimes shows “Log in to X” page
- Residential proxies (super: true) usually bypass this
Verify HTML structure:
- X changes their HTML frequently
- Check parseXHtml() regex patterns
- View raw HTML response in browser network tab

Reddit Returns HTML Instead of JSON

Problem: Reddit returns HTML login page instead of JSON Solutions:

Enable residential proxies:
```
{ render: false, super: true }
```
Verify URL:
- Ensure .json extension: reddit.com/search.json
- Check query encoding

HTTP 402 (Payment Required)

Problem: Scrape.do returns 402 status Solutions:

Check credit balance in dashboard
Add credits to your account
Review monthly quota limits

HTTP 429 (Rate Limited)

Problem: Too many requests Solutions:

Implement request throttling
Add delays between requests
Upgrade to higher rate limit plan
Use exponential backoff retry logic

HTTP 403/407 (Blocked)

Problem: Target site blocking requests Solutions:

Enable residential proxies:
```
{ super: true }
```
Add geo-targeting:
```
{ super: true, geoCode: 'us' }
```
Contact Scrape.do support if issue persists

Extending to New Platforms

To add support for new social platforms (e.g., Hacker News, LinkedIn):

Create parser function

Add a parser in src/services/scrapeDoProvider.ts:

export function parseHackerNewsHtml(html: string, query: string): ScrapedPost[] {
  const posts: ScrapedPost[] = [];
  // Parse HTML to extract posts
  return posts;
}

Create fetch function

export async function fetchHackerNewsPosts(
  query: string,
  token: string,
  options: ScrapeDoOptions = {}
): Promise<ScrapeDoResult> {
  const targetUrl = `https://hn.algolia.com/api/v1/search?query=${encodeURIComponent(query)}`;
  const apiUrl = buildApiUrl(token, targetUrl, { render: false, ...options });
  
  const res = await fetch(apiUrl);
  const data = await res.json();
  const posts = parseHackerNewsJson(data, query);
  
  return {
    posts,
    source: 'Hacker News via Scrape.do',
    status: posts.length > 0 ? 'success' : 'partial'
  };
}

Add to aggregator

Update fetchAllScrapeDoSources() to include the new source:

export async function fetchAllScrapeDoSources(
  query: string,
  token: string,
  sources: Array<'x' | 'reddit' | 'hackernews'> = ['x', 'reddit'],
  options: ScrapeDoOptions = {}
) {
  // Add new source to fetchers array
}

Update UI

Add source chip in TopicDetail.tsx to display Hacker News status.

API Reference

`buildApiUrl(token, targetUrl, options)`

Builds the Scrape.do proxy URL. Parameters:

token (string): Scrape.do API token
targetUrl (string): URL to scrape
options (ScrapeDoOptions): Configuration options

Returns: string - Full Scrape.do API URL

`fetchXPosts(query, token, options)`

Fetch X (Twitter) posts. Returns: Promise<ScrapeDoResult>

`fetchRedditPosts(query, token, options)`

Fetch Reddit posts. Returns: Promise<ScrapeDoResult>

`fetchAllScrapeDoSources(query, token, sources, options)`

Fetch from multiple sources in parallel. Returns: Promise<{ results: ScrapeDoResult[], posts: ScrapedPost[] }>

Next Steps

Environment Variables

Configure all API tokens

API Keys

Get additional API keys for fallback sources

Scrape.do Docs

Official Scrape.do documentation

Scrape.do Dashboard

Monitor usage and credits

Get Started

Setup & Configuration

Core Features

Data Sources

Guides

​Why Scrape.do?

​Getting Started

​Architecture

​1. Client-Side Scraping

​2. Server-Side Scraping

​Configuration Options

​Platform-Specific Strategies

​X (Twitter)

​Reddit

​Usage Examples

​Basic Fetch

​Fetch All Sources in Parallel

​Advanced: Residential Proxies + Geo-Targeting

​Error Handling

​Edge Function Implementation

​Cost Optimization

​Credit Usage

​Best Practices

​Troubleshooting

​Empty Results from X

​Reddit Returns HTML Instead of JSON

​HTTP 402 (Payment Required)

​HTTP 429 (Rate Limited)

​HTTP 403/407 (Blocked)

​Extending to New Platforms

​API Reference

​buildApiUrl(token, targetUrl, options)

​fetchXPosts(query, token, options)

​fetchRedditPosts(query, token, options)

​fetchAllScrapeDoSources(query, token, sources, options)

​Next Steps

Environment Variables

API Keys

Scrape.do Docs

Scrape.do Dashboard

Build docs developers (and LLMs) love

Why Scrape.do?

Getting Started

Architecture

1. Client-Side Scraping

2. Server-Side Scraping

Configuration Options

Platform-Specific Strategies

X (Twitter)

Reddit

Usage Examples

Basic Fetch

Fetch All Sources in Parallel

Advanced: Residential Proxies + Geo-Targeting

Error Handling

Edge Function Implementation

Cost Optimization

Credit Usage

Best Practices

Troubleshooting

Empty Results from X

Reddit Returns HTML Instead of JSON

HTTP 402 (Payment Required)

HTTP 429 (Rate Limited)

HTTP 403/407 (Blocked)

Extending to New Platforms

API Reference

`buildApiUrl(token, targetUrl, options)`

`fetchXPosts(query, token, options)`

`fetchRedditPosts(query, token, options)`

`fetchAllScrapeDoSources(query, token, sources, options)`

Next Steps