Overview
The llms.txt Generator uses an intelligent crawling system that automatically discovers and extracts content from websites. The crawler employs a breadth-first search (BFS) algorithm with sitemap support, configurable limits, and fallback mechanisms for JavaScript-heavy sites.Key Features
Sitemap Discovery
Automatically detects and parses XML sitemaps for efficient crawling
BFS Traversal
Breadth-first search ensures comprehensive coverage of site structure
Smart Filtering
Skips irrelevant pages (admin, API, media files) automatically
Content Detection
Validates meaningful content and escalates to browser rendering when needed
How It Works
Sitemap Check
The crawler first attempts to locate a sitemap at If found, URLs are extracted and filtered to match the base domain.
/sitemap.xml or /sitemap_index.xml:backend/crawler/llm_crawler.py
BFS Crawling
Pages are visited in breadth-first order until limits are reached:
backend/crawler/llm_crawler.py
Smart Filtering
The crawler automatically skips irrelevant URLs using pattern matching:backend/crawler/scout.py
Why filter? Skipping administrative pages, media files, and API endpoints ensures the crawler focuses on human-readable content that’s relevant for LLM indexing.
Content Detection
The crawler validates that fetched pages contain meaningful content:backend/crawler/llm_crawler.py
For JavaScript-heavy sites that return empty content, the crawler automatically escalates to Bright Data’s Scraping Browser with full browser rendering.
Configuration
Crawl behavior can be customized via WebSocket parameters:Maximum number of pages to crawl before stopping
Character limit for page descriptions/snippets
Enable Bright Data Scraping Browser for JavaScript-heavy sites
Example Usage
Crawl State Management
The crawler maintains state to prevent duplicate visits:backend/crawler/state.py
visited: Set of already-crawled URLs (prevents loops)queue: Deque for BFS ordering (FIFO)base_url: Root domain for same-domain filtering
Performance Characteristics
Sitemap Mode
- Speed: Fast (parallel sitemap parsing)
- Coverage: Complete (all sitemap URLs)
- Best for: Large sites with comprehensive sitemaps
BFS Mode
- Speed: Moderate (sequential page crawling)
- Coverage: Limited by max_pages
- Best for: Small sites or those without sitemaps
Best Practices
Set appropriate page limits
Set appropriate page limits
Start with 50 pages for small sites, increase to 200-500 for larger documentation sites. Very high limits (>1000) may time out.
Use sitemap hints
Use sitemap hints
If you know a site has a sitemap, the crawler will automatically find it. For sitemap indexes, all child sitemaps are parsed recursively.
Enable Bright Data for SPAs
Enable Bright Data for SPAs
Single-page applications (React, Vue, Angular) require JavaScript rendering. Set
useBrightdata: true for these sites.Monitor crawl logs
Monitor crawl logs
Watch for
✗ httpx returned empty/blocked content messages - these indicate pages that need browser rendering.Troubleshooting
Crawler only finds a few pages
Crawler only finds a few pages
Cause: Site may block the user agent or require JavaScript rendering.Solution: Enable
useBrightdata: true to use browser-based crawling.Crawl times out before reaching maxPages
Crawl times out before reaching maxPages
Cause: Each page takes time to fetch and parse. Very high page counts may exceed WebSocket timeout.Solution: Reduce
maxPages or increase timeout in your WebSocket client.Wrong pages are being crawled
Wrong pages are being crawled
Cause: BFS discovers links from navigation/footer that aren’t relevant.Solution: The skip patterns filter most issues. For additional filtering, consider adjusting
SKIP_PATTERNS in crawler/scout.py.Next Steps
Real-time Streaming
Learn how crawl progress is streamed live via WebSocket
LLM Enhancement
Enhance crawled content with AI-powered optimization