Reddit Data Source

Overview

The Reddit integration uses Scrape.do to access Reddit’s search API and HTML interface. Unlike X/Twitter, Reddit provides a .json endpoint that returns structured data without requiring JavaScript rendering, making it faster and more reliable.

Reddit data collection is implemented in supabase/functions/fetch-reddit/index.ts as a standalone edge function.

How It Works

Primary Strategy: Reddit JSON API

The main Twitter fetcher (fetch-twitter) includes Reddit scraping as part of its parallel fetch:

const redditUrl = `https://www.reddit.com/search.json?q=${encodeURIComponent(topic.query)}&sort=new&limit=25`;

const redditResult = await fetch(
  buildScrapeDoUrl(SCRAPE_DO_TOKEN, redditUrl, { render: false })
);

if (redditResult.ok) {
  const data = await redditResult.json();
  const redditPosts = parseRedditJson(data, topic.query);
  posts.push(...redditPosts);
}

Why render: false? Reddit’s JSON endpoint returns pure JSON without JavaScript. Disabling rendering reduces latency by ~60% and saves Scrape.do credits.

Fallback: HTML Scraping

The standalone fetch-reddit function scrapes Reddit’s HTML when JSON parsing fails:

const redditUrl = `https://www.reddit.com/search/?q=${encodeURIComponent(topic.query)}&sort=new`;
const scrapeUrl = buildScrapeDoUrl(SCRAPE_DO_TOKEN, redditUrl);

const res = await fetch(scrapeUrl);
if (res.ok) {
  const html = await res.text();
  
  // Check for bot detection
  if (html.includes("Are you a human?") || html.length < 5000) {
    scrapeStatus = "blocked";
  } else {
    const sentences = parseRedditHtml(html);
    posts = sentences.map((text, i) => ({
      id: `reddit_scrape_${topic_id}_${i}`,
      text,
      author: "reddit_user",
      created_at: new Date().toISOString(),
    }));
  }
}

JSON Parser

The JSON parser extracts post titles and body text from Reddit’s API response:

function parseRedditJson(data: unknown, query: string): ScrapedPost[] {
  const posts: ScrapedPost[] = [];
  const record = data as Record<string, unknown>;
  const dataNode = record?.data as Record<string, unknown> | undefined;
  const children = (dataNode?.children as Array<Record<string, unknown>>) ?? [];

  for (const child of children) {
    const post = child?.data as Record<string, unknown> | undefined;
    if (!post) continue;
    
    const title = (post.title as string) ?? "";
    const selftext = (post.selftext as string) ?? "";
    const combined = [title, selftext].filter(Boolean).join(". ");
    const text = decodeEntities(combined.substring(0, 500));
    
    if (text.length > 10) {
      posts.push({
        id: `reddit_${post.id ?? posts.length}`,
        text,
        author: `u/${(post.author as string) ?? "redditor"}`,
        platform: "reddit",
        url: (post.url as string) ?? `https://www.reddit.com/search/?q=${encodeURIComponent(query)}`,
        postedAt: post.created_utc
          ? new Date((post.created_utc as number) * 1000).toISOString()
          : new Date().toISOString(),
      });
    }
  }

  return posts;
}

JSON Response Structure

{
  "data": {
    "children": [
      {
        "data": {
          "id": "abc123",
          "title": "This is amazing!",
          "selftext": "Detailed post content here...",
          "author": "reddit_user",
          "url": "https://reddit.com/r/subreddit/comments/abc123",
          "created_utc": 1678901234,
          "subreddit": "technology",
          "score": 142,
          "num_comments": 23
        }
      }
    ]
  }
}

HTML Parser

When JSON parsing fails, the HTML parser targets Reddit’s web component structure:

function parseRedditHtml(html: string): string[] {
  const results: string[] = [];

  // Strategy 1: shreddit-post web-component attribute
  const shredditRe = /post-title="([^"]{20,300})"/gi;
  let match: RegExpExecArray | null;

  while ((match = shredditRe.exec(html)) !== null) {
    const title = decodeHtmlEntities(match[1]).trim();
    if (!results.includes(title)) results.push(title);
    if (results.length >= 20) break;
  }

  // Strategy 2: h3 headings (classic Reddit fallback)
  if (results.length < 3) {
    const h3Re = /<h3[^>]*>([\s\S]{20,300}?)<\/h3>/gi;
    while ((match = h3Re.exec(html)) !== null) {
      const text = stripHtml(match[1]).trim();
      if (text.length >= 20 && !results.includes(text)) results.push(text);
      if (results.length >= 20) break;
    }
  }

  // Strategy 3: paragraph snippets (post-body previews)
  if (results.length < 5) {
    const pRe = /<p[^>]*>([\s\S]{30,300}?)<\/p>/gi;
    while ((match = pRe.exec(html)) !== null) {
      const text = stripHtml(match[1]).trim();
      if (text.length >= 30 && !results.includes(text)) results.push(text);
      if (results.length >= 20) break;
    }
  }

  return results.slice(0, 20);
}

Why multiple parsing strategies?

Reddit has multiple UI versions:

New Reddit uses <shreddit-post> web components
Classic Reddit uses <h3> tags for titles
Mobile Reddit uses <p> tags for body previews

The parser tries all strategies to maximize compatibility across versions.

Scrape.do Configuration

buildScrapeDoUrl(SCRAPE_DO_TOKEN, 
  "https://www.reddit.com/search.json?q=topic&sort=new&limit=25",
  { 
    render: false  // No JavaScript needed for JSON
  }
);

Error Detection

Reddit implements bot detection that returns HTTP 200 with challenge pages:

if (res.ok) {
  const html = await res.text();
  
  // Check for CAPTCHA or empty page
  if (html.includes("Are you a human?") || 
      (!html.toLowerCase().includes("reddit") && html.length < 5000)) {
    scrapeStatus = "blocked";
    console.warn("Scrape.do/Reddit: bot-check detected");
  } else {
    const sentences = parseRedditHtml(html);
    scrapeStatus = sentences.length > 0 ? "ok" : "blocked";
  }
}

Reddit’s “Are you a human?” page returns HTTP 200. Always inspect HTML content for challenge pages.

Database Persistence

let inserted = 0;
for (const post of posts) {
  const { error } = await supabase.from("posts").upsert(
    {
      topic_id,
      platform: "reddit",
      external_id: post.id,
      author: `@${post.author}`,  // Normalize to @username format
      content: post.text,
      posted_at: post.created_at,
    },
    { onConflict: "platform,external_id" }
  );
  if (!error) inserted++;
}

Rate Limits

Reddit API Limits

Reddit’s official API limit is 60 requests/minute per IP. Scrape.do’s proxy rotation helps avoid this limit.

Scrape.do Limits

Plan	Requests/Month	Cost per Request
Free	1,000	$0
Starter	100,000	~$0.001
Pro	1,000,000	~$0.0005

Reddit scraping without rendering (render: false) uses half the credits of rendered requests.

Response Format

{
  "success": true,
  "fetched": 20,
  "inserted": 18,
  "info": "Scrape.do (Reddit)",
  "scrape_status": "ok"
}

Status Codes

`scrape_status`	Meaning	Action
`ok`	Successfully scraped and parsed	Store data
`blocked`	Bot detection or CAPTCHA	Enable `super: true`
`quota`	Scrape.do quota exceeded	Wait or upgrade plan
`no_token`	Missing `SCRAPE_DO_TOKEN`	Set environment variable
`error`	Network/parsing error	Check logs

Comparison: JSON vs HTML

Aspect	JSON Endpoint	HTML Endpoint
Speed	~500ms	~2000ms
Reliability	95%+	70% (depends on bot detection)
Data Quality	Full metadata (author, timestamps, URLs)	Title/text only
Scrape.do Credits	0.5x	1x
Rendering Required	No	Yes
Best For	Production	Fallback

Always prefer the JSON endpoint unless Reddit blocks it. The HTML parser is a last resort.

Environment Setup

SCRAPE_DO_TOKEN=your_scrape_do_token_here
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_SERVICE_ROLE_KEY=your_service_key

Testing

supabase functions serve fetch-reddit --env-file .env

curl -X POST http://localhost:54321/functions/v1/fetch-reddit \
  -H "Authorization: Bearer ${SUPABASE_SERVICE_ROLE_KEY}" \
  -H "Content-Type: application/json" \
  -d '{"topic_id": "your-topic-uuid"}'

Common Issues

Reddit returns non-JSON (scrape_status: error)

Cause: Reddit detected datacenter IP and returned HTML challenge pageSolution:

buildScrapeDoUrl(token, redditUrl, { 
  render: false,
  super: true  // Enable residential proxies
});

Empty results despite 200 OK

Causes:

Query has no Reddit results (check reddit.com/search manually)
HTML structure changed (update parser regexes)
Shadow ban or rate limit

Debug:

# Save response HTML
curl "$(buildScrapeDoUrl TOKEN 'https://reddit.com/search.json?q=test')" > reddit.json
cat reddit.json | jq '.data.children | length'

Bot detection (Are you a human?)

Solution: Enable Scrape.do’s residential proxies:

{ super: true, geoCode: "us" }

This increases success rate from ~70% to ~95% but doubles credit cost.

Integration with Twitter Fetcher

Reddit scraping is automatically included when fetching Twitter data:

// In supabase/functions/fetch-twitter/index.ts
const [xResult, redditResult] = await Promise.allSettled([
  fetch(buildScrapeDoUrl(token, xUrl, { render: true })),
  fetch(buildScrapeDoUrl(token, redditUrl, { render: false }))
]);

// Both sources are merged
posts.push(...xPosts, ...redditPosts);

This parallel approach reduces total latency from ~4s to ~2s compared to sequential fetching.

Get Started

Setup & Configuration

Core Features

Data Sources

Guides

Overview

How It Works

Primary Strategy: Reddit JSON API

Fallback: HTML Scraping

JSON Parser

JSON Response Structure

HTML Parser

Scrape.do Configuration

Error Detection

Database Persistence

Rate Limits

Reddit API Limits

Scrape.do Limits

Response Format

Status Codes

Comparison: JSON vs HTML

Environment Setup

Testing

Common Issues

Integration with Twitter Fetcher

Next Steps

YouTube Integration

Sentiment Analysis

Build docs developers (and LLMs) love

Get Started

Setup & Configuration

Core Features

Data Sources

Guides

​Overview

​How It Works

​Primary Strategy: Reddit JSON API

​Fallback: HTML Scraping

​JSON Parser

​JSON Response Structure

​HTML Parser

​Scrape.do Configuration

​Error Detection

​Database Persistence

​Rate Limits

​Reddit API Limits

​Scrape.do Limits

​Response Format

​Status Codes

​Comparison: JSON vs HTML

​Environment Setup

​Testing

​Common Issues

​Integration with Twitter Fetcher

​Next Steps

YouTube Integration

Sentiment Analysis

Build docs developers (and LLMs) love

Overview

How It Works

Primary Strategy: Reddit JSON API

Fallback: HTML Scraping

JSON Parser

JSON Response Structure

HTML Parser

Scrape.do Configuration

Error Detection

Database Persistence

Rate Limits

Reddit API Limits

Scrape.do Limits

Response Format

Status Codes

Comparison: JSON vs HTML

Environment Setup

Testing

Common Issues

Integration with Twitter Fetcher

Next Steps