Skip to main content

Overview

The Reddit integration uses Scrape.do to access Reddit’s search API and HTML interface. Unlike X/Twitter, Reddit provides a .json endpoint that returns structured data without requiring JavaScript rendering, making it faster and more reliable.
Reddit data collection is implemented in supabase/functions/fetch-reddit/index.ts as a standalone edge function.

How It Works

Primary Strategy: Reddit JSON API

The main Twitter fetcher (fetch-twitter) includes Reddit scraping as part of its parallel fetch:
const redditUrl = `https://www.reddit.com/search.json?q=${encodeURIComponent(topic.query)}&sort=new&limit=25`;

const redditResult = await fetch(
  buildScrapeDoUrl(SCRAPE_DO_TOKEN, redditUrl, { render: false })
);

if (redditResult.ok) {
  const data = await redditResult.json();
  const redditPosts = parseRedditJson(data, topic.query);
  posts.push(...redditPosts);
}
Why render: false? Reddit’s JSON endpoint returns pure JSON without JavaScript. Disabling rendering reduces latency by ~60% and saves Scrape.do credits.

Fallback: HTML Scraping

The standalone fetch-reddit function scrapes Reddit’s HTML when JSON parsing fails:
const redditUrl = `https://www.reddit.com/search/?q=${encodeURIComponent(topic.query)}&sort=new`;
const scrapeUrl = buildScrapeDoUrl(SCRAPE_DO_TOKEN, redditUrl);

const res = await fetch(scrapeUrl);
if (res.ok) {
  const html = await res.text();
  
  // Check for bot detection
  if (html.includes("Are you a human?") || html.length < 5000) {
    scrapeStatus = "blocked";
  } else {
    const sentences = parseRedditHtml(html);
    posts = sentences.map((text, i) => ({
      id: `reddit_scrape_${topic_id}_${i}`,
      text,
      author: "reddit_user",
      created_at: new Date().toISOString(),
    }));
  }
}

JSON Parser

The JSON parser extracts post titles and body text from Reddit’s API response:
function parseRedditJson(data: unknown, query: string): ScrapedPost[] {
  const posts: ScrapedPost[] = [];
  const record = data as Record<string, unknown>;
  const dataNode = record?.data as Record<string, unknown> | undefined;
  const children = (dataNode?.children as Array<Record<string, unknown>>) ?? [];

  for (const child of children) {
    const post = child?.data as Record<string, unknown> | undefined;
    if (!post) continue;
    
    const title = (post.title as string) ?? "";
    const selftext = (post.selftext as string) ?? "";
    const combined = [title, selftext].filter(Boolean).join(". ");
    const text = decodeEntities(combined.substring(0, 500));
    
    if (text.length > 10) {
      posts.push({
        id: `reddit_${post.id ?? posts.length}`,
        text,
        author: `u/${(post.author as string) ?? "redditor"}`,
        platform: "reddit",
        url: (post.url as string) ?? `https://www.reddit.com/search/?q=${encodeURIComponent(query)}`,
        postedAt: post.created_utc
          ? new Date((post.created_utc as number) * 1000).toISOString()
          : new Date().toISOString(),
      });
    }
  }

  return posts;
}

JSON Response Structure

{
  "data": {
    "children": [
      {
        "data": {
          "id": "abc123",
          "title": "This is amazing!",
          "selftext": "Detailed post content here...",
          "author": "reddit_user",
          "url": "https://reddit.com/r/subreddit/comments/abc123",
          "created_utc": 1678901234,
          "subreddit": "technology",
          "score": 142,
          "num_comments": 23
        }
      }
    ]
  }
}

HTML Parser

When JSON parsing fails, the HTML parser targets Reddit’s web component structure:
function parseRedditHtml(html: string): string[] {
  const results: string[] = [];

  // Strategy 1: shreddit-post web-component attribute
  const shredditRe = /post-title="([^"]{20,300})"/gi;
  let match: RegExpExecArray | null;

  while ((match = shredditRe.exec(html)) !== null) {
    const title = decodeHtmlEntities(match[1]).trim();
    if (!results.includes(title)) results.push(title);
    if (results.length >= 20) break;
  }

  // Strategy 2: h3 headings (classic Reddit fallback)
  if (results.length < 3) {
    const h3Re = /<h3[^>]*>([\s\S]{20,300}?)<\/h3>/gi;
    while ((match = h3Re.exec(html)) !== null) {
      const text = stripHtml(match[1]).trim();
      if (text.length >= 20 && !results.includes(text)) results.push(text);
      if (results.length >= 20) break;
    }
  }

  // Strategy 3: paragraph snippets (post-body previews)
  if (results.length < 5) {
    const pRe = /<p[^>]*>([\s\S]{30,300}?)<\/p>/gi;
    while ((match = pRe.exec(html)) !== null) {
      const text = stripHtml(match[1]).trim();
      if (text.length >= 30 && !results.includes(text)) results.push(text);
      if (results.length >= 20) break;
    }
  }

  return results.slice(0, 20);
}
Reddit has multiple UI versions:
  • New Reddit uses <shreddit-post> web components
  • Classic Reddit uses <h3> tags for titles
  • Mobile Reddit uses <p> tags for body previews
The parser tries all strategies to maximize compatibility across versions.

Scrape.do Configuration

buildScrapeDoUrl(SCRAPE_DO_TOKEN, 
  "https://www.reddit.com/search.json?q=topic&sort=new&limit=25",
  { 
    render: false  // No JavaScript needed for JSON
  }
);

Error Detection

Reddit implements bot detection that returns HTTP 200 with challenge pages:
if (res.ok) {
  const html = await res.text();
  
  // Check for CAPTCHA or empty page
  if (html.includes("Are you a human?") || 
      (!html.toLowerCase().includes("reddit") && html.length < 5000)) {
    scrapeStatus = "blocked";
    console.warn("Scrape.do/Reddit: bot-check detected");
  } else {
    const sentences = parseRedditHtml(html);
    scrapeStatus = sentences.length > 0 ? "ok" : "blocked";
  }
}
Reddit’s “Are you a human?” page returns HTTP 200. Always inspect HTML content for challenge pages.

Database Persistence

let inserted = 0;
for (const post of posts) {
  const { error } = await supabase.from("posts").upsert(
    {
      topic_id,
      platform: "reddit",
      external_id: post.id,
      author: `@${post.author}`,  // Normalize to @username format
      content: post.text,
      posted_at: post.created_at,
    },
    { onConflict: "platform,external_id" }
  );
  if (!error) inserted++;
}

Rate Limits

Reddit API Limits

Reddit’s official API limit is 60 requests/minute per IP. Scrape.do’s proxy rotation helps avoid this limit.

Scrape.do Limits

PlanRequests/MonthCost per Request
Free1,000$0
Starter100,000~$0.001
Pro1,000,000~$0.0005
Reddit scraping without rendering (render: false) uses half the credits of rendered requests.

Response Format

{
  "success": true,
  "fetched": 20,
  "inserted": 18,
  "info": "Scrape.do (Reddit)",
  "scrape_status": "ok"
}

Status Codes

scrape_statusMeaningAction
okSuccessfully scraped and parsedStore data
blockedBot detection or CAPTCHAEnable super: true
quotaScrape.do quota exceededWait or upgrade plan
no_tokenMissing SCRAPE_DO_TOKENSet environment variable
errorNetwork/parsing errorCheck logs

Comparison: JSON vs HTML

AspectJSON EndpointHTML Endpoint
Speed~500ms~2000ms
Reliability95%+70% (depends on bot detection)
Data QualityFull metadata (author, timestamps, URLs)Title/text only
Scrape.do Credits0.5x1x
Rendering RequiredNoYes
Best ForProductionFallback
Always prefer the JSON endpoint unless Reddit blocks it. The HTML parser is a last resort.

Environment Setup

SCRAPE_DO_TOKEN=your_scrape_do_token_here
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_SERVICE_ROLE_KEY=your_service_key

Testing

supabase functions serve fetch-reddit --env-file .env

curl -X POST http://localhost:54321/functions/v1/fetch-reddit \
  -H "Authorization: Bearer ${SUPABASE_SERVICE_ROLE_KEY}" \
  -H "Content-Type: application/json" \
  -d '{"topic_id": "your-topic-uuid"}'

Common Issues

Cause: Reddit detected datacenter IP and returned HTML challenge pageSolution:
buildScrapeDoUrl(token, redditUrl, { 
  render: false,
  super: true  // Enable residential proxies
});
Causes:
  1. Query has no Reddit results (check reddit.com/search manually)
  2. HTML structure changed (update parser regexes)
  3. Shadow ban or rate limit
Debug:
# Save response HTML
curl "$(buildScrapeDoUrl TOKEN 'https://reddit.com/search.json?q=test')" > reddit.json
cat reddit.json | jq '.data.children | length'
Solution: Enable Scrape.do’s residential proxies:
{ super: true, geoCode: "us" }
This increases success rate from ~70% to ~95% but doubles credit cost.

Integration with Twitter Fetcher

Reddit scraping is automatically included when fetching Twitter data:
// In supabase/functions/fetch-twitter/index.ts
const [xResult, redditResult] = await Promise.allSettled([
  fetch(buildScrapeDoUrl(token, xUrl, { render: true })),
  fetch(buildScrapeDoUrl(token, redditUrl, { render: false }))
]);

// Both sources are merged
posts.push(...xPosts, ...redditPosts);
This parallel approach reduces total latency from ~4s to ~2s compared to sequential fetching.

Next Steps

YouTube Integration

Learn about YouTube comment collection

Sentiment Analysis

How Reddit posts are analyzed

Build docs developers (and LLMs) love