Overview
This page documents the complete data flow through the llms.txt Generator system, from initial user request to final llms.txt file delivery.User-Initiated Crawl Flow
Step-by-Step Flow Diagram
Detailed Step Breakdown
User Input
User enters website URL and configuration:
url: Target website (e.g.,https://example.com)maxPages: Page limit (default: 50)descLength: Description length (default: 500)enableAutoUpdate: Schedule recrawls (default: false)recrawlIntervalMinutes: Recrawl frequency (default: 360)llmEnhance: AI optimization (default: false)useBrightdata: Use proxy for JS sites (default: false)
Authentication
Backend validates credentials:
- API Key: Query parameter
?api_key=xxx(main.py:98-101) - JWT Token: Alternative via
?token=xxx(main.py:92-96) - Rejection: Closes connection with code 1008 if invalid
Sitemap Detection
Crawler attempts to find sitemap (llm_crawler.py:30-38):
- Try
/sitemap.xml - Try
/sitemap_index.xml - Parse XML and extract URLs
- If found: Populate queue with sitemap URLs
- If not found: Use BFS crawl starting from homepage
Page Fetching
For each URL in queue (llm_crawler.py:66-95):Attempt 1: httpx (fast)
- HTTP GET request with 10s timeout
- Check if HTML has meaningful content
- Success: Use this HTML
- Only if httpx fails or no meaningful content
- Launch headless Chromium browser
- Wait for network idle
- Extract rendered HTML
- Only if enabled and previous attempts fail
- Use Brightdata’s scraping browser
- Bypass anti-bot protections
Content Extraction
BeautifulSoup parses HTML (text.py):Title Extraction:
- Try
<title>tag - Fallback to
<h1>tag - Fallback to “Untitled”
- Try
<meta name="description"> - Fallback to
<meta property="og:description"> - Fallback to first paragraph
- Extract visible text from body
- Remove script/style tags
- Clean whitespace
- Create snippet (truncate to desc_length)
- Extract all
<a href>links - Normalize URLs (relative → absolute)
- Filter to same domain
- Add to queue if not visited
llms.txt Generation
Format pages into llms.txt spec (formatter.py):Structure:Features:
- Hierarchical section grouping
- Clean URL generation (prefer .md links)
- Content truncation at semantic boundaries
- Metadata header with timestamp
LLM Enhancement (Optional)
If
llmEnhance: true (main.py:138-151):- Send llms.txt to OpenRouter API
- Use Grok 4.1-Fast model
- Prompt: Summarize, optimize, improve readability
- Replace original with enhanced version
- Stream enhancement status to user
Real-time Updates
Backend streams messages to frontend via WebSocket:Log Messages:Result:Public URL:Errors:
Scheduled Recrawl Flow
Automatic Update Diagram
Recrawl Logic Details
Site Selection Query
Site Selection Query
- They’ve been crawled before (
last_crawled_at IS NOT NULL) - The next scheduled crawl time has passed
- Limited to 100 sites per run
Change Detection
Change Detection
Error Handling
Error Handling
- Site unreachable: Mark as failed, retry next cycle
- Crawl timeout: Partial results saved if any pages succeeded
- R2 upload fails: Log error, keep old URL in database
- Database errors: Log and continue to next site
WebSocket Message Types
Message Format Reference
Data Storage Details
Supabase Schema
R2 Storage Structure
Each recrawl creates a new file with a timestamp. Old versions remain accessible unless manually deleted.
Performance Characteristics
Typical Crawl Times
- Small site (5-10 pages): 10-30 seconds
- Medium site (20-50 pages): 30-90 seconds
- Large site (50+ pages with sitemap): 1-3 minutes
- JS-heavy site (with Brightdata): 2-5 minutes
Bottlenecks
- Page Fetch Speed: Limited by target site’s response time
- Playwright Launch: ~2-3 seconds per browser instance
- LLM Enhancement: ~5-10 seconds per request
- R2 Upload: Negligible (under 1 second)
Optimization Strategies
- Sitemap first: Faster than BFS crawl
- httpx before Playwright: 10x faster for static sites
- Concurrent fetching: asyncio for parallel requests
- Skip unchanged sites: Hash comparison prevents redundant uploads
Next Steps
API Reference
Complete WebSocket API documentation
Deployment
Deploy your own instance