Overview
The Reddit integration uses Scrape.do to access Reddit’s search API and HTML interface. Unlike X/Twitter, Reddit provides a .json endpoint that returns structured data without requiring JavaScript rendering, making it faster and more reliable.
Reddit data collection is implemented in supabase/functions/fetch-reddit/index.ts as a standalone edge function.
How It Works
Primary Strategy: Reddit JSON API
The main Twitter fetcher (fetch-twitter) includes Reddit scraping as part of its parallel fetch:
const redditUrl = `https://www.reddit.com/search.json?q= ${ encodeURIComponent ( topic . query ) } &sort=new&limit=25` ;
const redditResult = await fetch (
buildScrapeDoUrl ( SCRAPE_DO_TOKEN , redditUrl , { render: false })
);
if ( redditResult . ok ) {
const data = await redditResult . json ();
const redditPosts = parseRedditJson ( data , topic . query );
posts . push ( ... redditPosts );
}
Why render: false? Reddit’s JSON endpoint returns pure JSON without JavaScript. Disabling rendering reduces latency by ~60% and saves Scrape.do credits.
Fallback: HTML Scraping
The standalone fetch-reddit function scrapes Reddit’s HTML when JSON parsing fails:
const redditUrl = `https://www.reddit.com/search/?q= ${ encodeURIComponent ( topic . query ) } &sort=new` ;
const scrapeUrl = buildScrapeDoUrl ( SCRAPE_DO_TOKEN , redditUrl );
const res = await fetch ( scrapeUrl );
if ( res . ok ) {
const html = await res . text ();
// Check for bot detection
if ( html . includes ( "Are you a human?" ) || html . length < 5000 ) {
scrapeStatus = "blocked" ;
} else {
const sentences = parseRedditHtml ( html );
posts = sentences . map (( text , i ) => ({
id: `reddit_scrape_ ${ topic_id } _ ${ i } ` ,
text ,
author: "reddit_user" ,
created_at: new Date (). toISOString (),
}));
}
}
JSON Parser
The JSON parser extracts post titles and body text from Reddit’s API response:
function parseRedditJson ( data : unknown , query : string ) : ScrapedPost [] {
const posts : ScrapedPost [] = [];
const record = data as Record < string , unknown >;
const dataNode = record ?. data as Record < string , unknown > | undefined ;
const children = ( dataNode ?. children as Array < Record < string , unknown >>) ?? [];
for ( const child of children ) {
const post = child ?. data as Record < string , unknown > | undefined ;
if ( ! post ) continue ;
const title = ( post . title as string ) ?? "" ;
const selftext = ( post . selftext as string ) ?? "" ;
const combined = [ title , selftext ]. filter ( Boolean ). join ( ". " );
const text = decodeEntities ( combined . substring ( 0 , 500 ));
if ( text . length > 10 ) {
posts . push ({
id: `reddit_ ${ post . id ?? posts . length } ` ,
text ,
author: `u/ ${ ( post . author as string ) ?? "redditor" } ` ,
platform: "reddit" ,
url: ( post . url as string ) ?? `https://www.reddit.com/search/?q= ${ encodeURIComponent ( query ) } ` ,
postedAt: post . created_utc
? new Date (( post . created_utc as number ) * 1000 ). toISOString ()
: new Date (). toISOString (),
});
}
}
return posts ;
}
JSON Response Structure
{
"data" : {
"children" : [
{
"data" : {
"id" : "abc123" ,
"title" : "This is amazing!" ,
"selftext" : "Detailed post content here..." ,
"author" : "reddit_user" ,
"url" : "https://reddit.com/r/subreddit/comments/abc123" ,
"created_utc" : 1678901234 ,
"subreddit" : "technology" ,
"score" : 142 ,
"num_comments" : 23
}
}
]
}
}
HTML Parser
When JSON parsing fails, the HTML parser targets Reddit’s web component structure:
function parseRedditHtml ( html : string ) : string [] {
const results : string [] = [];
// Strategy 1: shreddit-post web-component attribute
const shredditRe = /post-title=" ( [ ^ " ] {20,300} ) "/ gi ;
let match : RegExpExecArray | null ;
while (( match = shredditRe . exec ( html )) !== null ) {
const title = decodeHtmlEntities ( match [ 1 ]). trim ();
if ( ! results . includes ( title )) results . push ( title );
if ( results . length >= 20 ) break ;
}
// Strategy 2: h3 headings (classic Reddit fallback)
if ( results . length < 3 ) {
const h3Re = /<h3 [ ^ > ] * > ( [ \s\S ] {20,300}? ) < \/ h3>/ gi ;
while (( match = h3Re . exec ( html )) !== null ) {
const text = stripHtml ( match [ 1 ]). trim ();
if ( text . length >= 20 && ! results . includes ( text )) results . push ( text );
if ( results . length >= 20 ) break ;
}
}
// Strategy 3: paragraph snippets (post-body previews)
if ( results . length < 5 ) {
const pRe = /<p [ ^ > ] * > ( [ \s\S ] {30,300}? ) < \/ p>/ gi ;
while (( match = pRe . exec ( html )) !== null ) {
const text = stripHtml ( match [ 1 ]). trim ();
if ( text . length >= 30 && ! results . includes ( text )) results . push ( text );
if ( results . length >= 20 ) break ;
}
}
return results . slice ( 0 , 20 );
}
Why multiple parsing strategies?
Reddit has multiple UI versions:
New Reddit uses <shreddit-post> web components
Classic Reddit uses <h3> tags for titles
Mobile Reddit uses <p> tags for body previews
The parser tries all strategies to maximize compatibility across versions.
Scrape.do Configuration
JSON Endpoint (Preferred)
HTML Endpoint (Fallback)
buildScrapeDoUrl ( SCRAPE_DO_TOKEN ,
"https://www.reddit.com/search.json?q=topic&sort=new&limit=25" ,
{
render: false // No JavaScript needed for JSON
}
);
Error Detection
Reddit implements bot detection that returns HTTP 200 with challenge pages:
if ( res . ok ) {
const html = await res . text ();
// Check for CAPTCHA or empty page
if ( html . includes ( "Are you a human?" ) ||
( ! html . toLowerCase (). includes ( "reddit" ) && html . length < 5000 )) {
scrapeStatus = "blocked" ;
console . warn ( "Scrape.do/Reddit: bot-check detected" );
} else {
const sentences = parseRedditHtml ( html );
scrapeStatus = sentences . length > 0 ? "ok" : "blocked" ;
}
}
Reddit’s “Are you a human?” page returns HTTP 200. Always inspect HTML content for challenge pages.
Database Persistence
let inserted = 0 ;
for ( const post of posts ) {
const { error } = await supabase . from ( "posts" ). upsert (
{
topic_id ,
platform: "reddit" ,
external_id: post . id ,
author: `@ ${ post . author } ` , // Normalize to @username format
content: post . text ,
posted_at: post . created_at ,
},
{ onConflict: "platform,external_id" }
);
if ( ! error ) inserted ++ ;
}
Rate Limits
Reddit API Limits
Reddit’s official API limit is 60 requests/minute per IP. Scrape.do’s proxy rotation helps avoid this limit.
Scrape.do Limits
Plan Requests/Month Cost per Request Free 1,000 $0 Starter 100,000 ~$0.001 Pro 1,000,000 ~$0.0005
Reddit scraping without rendering (render: false) uses half the credits of rendered requests.
{
"success" : true ,
"fetched" : 20 ,
"inserted" : 18 ,
"info" : "Scrape.do (Reddit)" ,
"scrape_status" : "ok"
}
Status Codes
scrape_statusMeaning Action okSuccessfully scraped and parsed Store data blockedBot detection or CAPTCHA Enable super: true quotaScrape.do quota exceeded Wait or upgrade plan no_tokenMissing SCRAPE_DO_TOKEN Set environment variable errorNetwork/parsing error Check logs
Comparison: JSON vs HTML
Aspect JSON Endpoint HTML Endpoint Speed ~500ms ~2000ms Reliability 95%+ 70% (depends on bot detection) Data Quality Full metadata (author, timestamps, URLs) Title/text only Scrape.do Credits 0.5x 1x Rendering Required No Yes Best For Production Fallback
Always prefer the JSON endpoint unless Reddit blocks it. The HTML parser is a last resort.
Environment Setup
SCRAPE_DO_TOKEN = your_scrape_do_token_here
SUPABASE_URL = https://your-project.supabase.co
SUPABASE_SERVICE_ROLE_KEY = your_service_key
Testing
supabase functions serve fetch-reddit --env-file .env
curl -X POST http://localhost:54321/functions/v1/fetch-reddit \
-H "Authorization: Bearer ${ SUPABASE_SERVICE_ROLE_KEY }" \
-H "Content-Type: application/json" \
-d '{"topic_id": "your-topic-uuid"}'
Common Issues
Reddit returns non-JSON (scrape_status: error)
Cause: Reddit detected datacenter IP and returned HTML challenge pageSolution: buildScrapeDoUrl ( token , redditUrl , {
render: false ,
super: true // Enable residential proxies
});
Empty results despite 200 OK
Causes:
Query has no Reddit results (check reddit.com/search manually)
HTML structure changed (update parser regexes)
Shadow ban or rate limit
Debug: # Save response HTML
curl "$( buildScrapeDoUrl TOKEN 'https://reddit.com/search.json?q=test')" > reddit.json
cat reddit.json | jq '.data.children | length'
Bot detection (Are you a human?)
Solution: Enable Scrape.do’s residential proxies:{ super : true , geoCode : "us" }
This increases success rate from ~70% to ~95% but doubles credit cost.
Reddit scraping is automatically included when fetching Twitter data:
// In supabase/functions/fetch-twitter/index.ts
const [ xResult , redditResult ] = await Promise . allSettled ([
fetch ( buildScrapeDoUrl ( token , xUrl , { render: true })),
fetch ( buildScrapeDoUrl ( token , redditUrl , { render: false }))
]);
// Both sources are merged
posts . push ( ... xPosts , ... redditPosts );
This parallel approach reduces total latency from ~4s to ~2s compared to sequential fetching.
Next Steps
YouTube Integration Learn about YouTube comment collection
Sentiment Analysis How Reddit posts are analyzed