Overview
Thefetch-reddit function collects Reddit posts and discussions related to a topic by scraping rendered HTML from Reddit search results. It uses multiple parsing strategies to extract post titles and body text from both new and old Reddit designs.
Endpoint
This function is designed to be called internally by the
analyze-topic orchestrator.Request
UUID of the topic to fetch Reddit posts for
Example Request
Response
Whether Reddit posts were successfully scraped and stored
Number of posts extracted from Reddit HTML
Number of posts successfully inserted into database
Description of the result (e.g., “Scrape.do (Reddit)”, “Reddit unavailable: quota”)
Status code: “ok” | “blocked” | “quota” | “no_token” | “error”
Success Response
Quota Exceeded Response
Bot Detection Response
Error Response
Scraping Strategy
Target URL
Searches Reddit’s global search sorted by newest:Scrape.do Configuration
Enables JavaScript rendering for dynamic content
Uses premium proxy pool for better success rate
Waits for all network requests to finish before capturing HTML
Routes request through US-based proxy servers
HTML Parsing Strategies
The function uses 4 cascading strategies to handle different Reddit layouts:Strategy 1: Web Component Attributes
Target: New Reddit design using<shreddit-post> custom elements
post-title attribute
Yield: 0-20 posts
Strategy 2: H3 Headings
Target: Classic Reddit or mobile fallback<h3> heading tags
Activation: Only if Strategy 1 finds <3 posts
Yield: 0-20 posts
Strategy 3: Paragraph Snippets
Target: Post body previews and comments<p> elements (minimum 30 characters)
Activation: Only if Strategies 1+2 find <5 posts combined
Yield: 0-20 posts
Strategy 4: Generic Sentence Extraction
Target: Any text content in the page Method: Strips all HTML and extracts sentences- Minimum length: 30 characters
- Maximum length: 300 characters
- Excludes URLs, mentions, hashtags, timestamps
Bot Detection
The function detects Reddit’s anti-bot measures:Post Schema
Extracted posts are stored in theposts table:
Deduplication
Posts are upserted using theplatform,external_id composite key:
Error Handling
Scrape Status Codes
| Status | HTTP Code | Meaning | Action |
|---|---|---|---|
ok | 200 | Successfully scraped | Data inserted |
blocked | 403/407 | Reddit bot detection triggered | Returns 0 posts |
quota | 402/429 | Scrape.do API quota exceeded | Returns 0 posts |
no_token | - | Missing SCRAPE_DO_TOKEN | Skips scraping |
error | Other | Network/parsing error | Returns 0 posts |
All status codes return HTTP 200 with
success: false to allow graceful orchestrator continuation.Performance
Typical execution time: 6-12 seconds- Fast path (cached/simple page): ~6-8s
- Rendered JavaScript: ~10-12s
- Bot detection retry: ~15-20s (usually fails)
Rate Limits
Reddit Rate Limits (handled by Scrape.do):- Aggressive bot detection
- IP-based throttling
- CAPTCHA challenges
Environment Variables
API token from Scrape.do dashboard. Get one at https://scrape.do
Auto-injected by Supabase
Auto-injected by Supabase
Best Practices
Monitor
scrape_status to detect quota exhaustion earlyCombine with
fetch-twitter for broader coverage (it also scrapes Reddit)Expect
blocked status frequently - Reddit has aggressive anti-scrapingConsider the Parallel AI fallback in
fetch-twitter as an alternative Reddit sourceLimitations
These limitations are due to Reddit’s dynamic rendering and frequent HTML structure changes.Related Functions
- analyze-topic - Orchestrator calling this function
- fetch-twitter - Also scrapes Reddit as a secondary source
- analyze-sentiment - Processes collected Reddit posts