Skip to main content

Overview

The fetch-reddit function collects Reddit posts and discussions related to a topic by scraping rendered HTML from Reddit search results. It uses multiple parsing strategies to extract post titles and body text from both new and old Reddit designs.

Endpoint

POST https://your-project.supabase.co/functions/v1/fetch-reddit
This function is designed to be called internally by the analyze-topic orchestrator.

Request

topic_id
string
required
UUID of the topic to fetch Reddit posts for

Example Request

curl -X POST https://your-project.supabase.co/functions/v1/fetch-reddit \
  -H "Authorization: Bearer YOUR_SERVICE_ROLE_KEY" \
  -H "Content-Type: application/json" \
  -d '{"topic_id": "a3f5e8b1-4c2d-4e9f-8a1b-3c5d6e7f8a9b"}'

Response

success
boolean
required
Whether Reddit posts were successfully scraped and stored
fetched
number
required
Number of posts extracted from Reddit HTML
inserted
number
required
Number of posts successfully inserted into database
info
string
required
Description of the result (e.g., “Scrape.do (Reddit)”, “Reddit unavailable: quota”)
scrape_status
string
required
Status code: “ok” | “blocked” | “quota” | “no_token” | “error”

Success Response

{
  "success": true,
  "fetched": 18,
  "inserted": 17,
  "info": "Scrape.do (Reddit)",
  "scrape_status": "ok"
}

Quota Exceeded Response

{
  "success": false,
  "fetched": 0,
  "inserted": 0,
  "info": "Reddit unavailable: quota",
  "scrape_status": "quota"
}

Bot Detection Response

{
  "success": false,
  "fetched": 0,
  "inserted": 0,
  "info": "Reddit unavailable: blocked",
  "scrape_status": "blocked"
}

Error Response

{
  "success": false,
  "error": "Topic not found"
}

Scraping Strategy

Target URL

Searches Reddit’s global search sorted by newest:
const redditUrl = `https://www.reddit.com/search/?q=${encodeURIComponent(topic.query)}&sort=new`;

Scrape.do Configuration

const scrapeUrl = buildScrapeDoUrl(SCRAPE_DO_TOKEN, redditUrl, {
  render: true,
  super: true,
  waitUntil: 'networkidle0',
  geoCode: 'us'
});
render
boolean
default:"true"
Enables JavaScript rendering for dynamic content
super
boolean
default:"true"
Uses premium proxy pool for better success rate
waitUntil
string
default:"networkidle0"
Waits for all network requests to finish before capturing HTML
geoCode
string
default:"us"
Routes request through US-based proxy servers

HTML Parsing Strategies

The function uses 4 cascading strategies to handle different Reddit layouts:

Strategy 1: Web Component Attributes

Target: New Reddit design using <shreddit-post> custom elements
const shredditRe = /post-title="([^"]{20,300})"/gi;
Extracts: Post titles from the post-title attribute Yield: 0-20 posts

Strategy 2: H3 Headings

Target: Classic Reddit or mobile fallback
const h3Re = /<h3[^>]*>([\s\S]{20,300}?)<\/h3>/gi;
Extracts: Post titles from <h3> heading tags Activation: Only if Strategy 1 finds <3 posts Yield: 0-20 posts

Strategy 3: Paragraph Snippets

Target: Post body previews and comments
const pRe = /<p[^>]*>([\s\S]{30,300}?)<\/p>/gi;
Extracts: Text from <p> elements (minimum 30 characters) Activation: Only if Strategies 1+2 find <5 posts combined Yield: 0-20 posts

Strategy 4: Generic Sentence Extraction

Target: Any text content in the page Method: Strips all HTML and extracts sentences
const sentences = extractSentences(plainText, 30, 300);
Filters:
  • Minimum length: 30 characters
  • Maximum length: 300 characters
  • Excludes URLs, mentions, hashtags, timestamps
Activation: Only if Strategies 1-3 find <5 posts combined Yield: 0-20 posts

Bot Detection

The function detects Reddit’s anti-bot measures:
if (
  html.includes('Are you a human?') || 
  (!html.toLowerCase().includes('reddit') && html.length < 5000)
) {
  scrapeStatus = 'blocked';
}

Post Schema

Extracted posts are stored in the posts table:
interface RedditPost {
  topic_id: string;           // UUID reference
  platform: 'reddit';         // Always "reddit"
  external_id: string;        // e.g., "reddit_scrape_abc_0"
  author: '@reddit_user';     // Generic placeholder
  content: string;            // Post title or body text
  posted_at: string;          // Current timestamp (ISO 8601)
}
Limitation: The current implementation does not extract individual post IDs, timestamps, or author usernames from the HTML. All posts are attributed to @reddit_user and timestamped at scrape time.

Deduplication

Posts are upserted using the platform,external_id composite key:
await supabase.from('posts').upsert(postData, {
  onConflict: 'platform,external_id'
});

Error Handling

Scrape Status Codes

StatusHTTP CodeMeaningAction
ok200Successfully scrapedData inserted
blocked403/407Reddit bot detection triggeredReturns 0 posts
quota402/429Scrape.do API quota exceededReturns 0 posts
no_token-Missing SCRAPE_DO_TOKENSkips scraping
errorOtherNetwork/parsing errorReturns 0 posts
All status codes return HTTP 200 with success: false to allow graceful orchestrator continuation.

Performance

Typical execution time: 6-12 seconds
  • Fast path (cached/simple page): ~6-8s
  • Rendered JavaScript: ~10-12s
  • Bot detection retry: ~15-20s (usually fails)

Rate Limits

Scrape.do Rate Limits:
  • Free tier: ~1,000 requests/month
  • Starter: ~10,000 requests/month
  • Growth: ~50,000 requests/month
Each call to this function consumes 1 Scrape.do request.
Reddit Rate Limits (handled by Scrape.do):
  • Aggressive bot detection
  • IP-based throttling
  • CAPTCHA challenges

Environment Variables

SCRAPE_DO_TOKEN
string
required
API token from Scrape.do dashboard. Get one at https://scrape.do
SUPABASE_URL
string
required
Auto-injected by Supabase
SUPABASE_SERVICE_ROLE_KEY
string
required
Auto-injected by Supabase

Best Practices

Monitor scrape_status to detect quota exhaustion early
Combine with fetch-twitter for broader coverage (it also scrapes Reddit)
Expect blocked status frequently - Reddit has aggressive anti-scraping
Consider the Parallel AI fallback in fetch-twitter as an alternative Reddit source

Limitations

Current Limitations:
  • No individual post IDs extracted
  • No author usernames extracted
  • No original timestamps extracted
  • All posts share generic metadata
  • No comment threads extracted (only top-level posts)
These limitations are due to Reddit’s dynamic rendering and frequent HTML structure changes.

Build docs developers (and LLMs) love