Overview
TheLLMCrawler class crawls websites starting from a base URL, extracting meaningful content from each page. It supports:
- Sitemap-based crawling (tries
/sitemap.xmland/sitemap_index.xml) - Breadth-first search (BFS) fallback when no sitemap exists
- Dual-fetching strategy: httpx → Bright Data Scraping Browser
- Content validation to detect blocked/empty responses
- Configurable page limits and description lengths
Classes
PageInfo
Data class representing extracted information from a single page.The full URL of the page
Extracted page title (from
<title> or <h1>)Page description (from meta tags or content)
Truncated text snippet from page content
LLMCrawler
Main crawler class that orchestrates the crawling process.Constructor
The starting URL to crawl (will be normalized)
Maximum number of pages to crawl
Maximum length for generated snippets
Async function for logging progress (receives string messages)
API key for Bright Data Scraping Browser
Whether to enable Bright Data fallback
Bright Data zone to use
Optional password for Bright Data authentication
Methods
run()
Executes the crawl and returns extracted page information.List of successfully crawled pages with extracted content
- Attempts to load sitemap (
/sitemap.xmlor/sitemap_index.xml) - If sitemap exists, uses those URLs; otherwise performs BFS crawl
- For each URL:
- First tries httpx (fast)
- Validates content for meaningful data
- Falls back to Bright Data if needed
- Extracts title, description, and text
- Discovers and queues new links
- Stops when
max_pagesreached or queue exhausted - Attempts up to
max_pages * 3URLs to handle failures - Logs usage stats if Bright Data was used
Usage Examples
Basic Crawl
With Bright Data Fallback
Sitemap-First Crawl
Fetching Strategy
The crawler uses a two-tier fetching strategy:Tier 1: httpx (Fast)
- Uses standard HTTP client with 10s timeout
- Follows redirects automatically
- Validates response has meaningful content
- Most cost-effective approach
Tier 2: Bright Data Scraping Browser (Reliable)
- Triggered when httpx fails or returns empty content
- Executes JavaScript, handles anti-bot measures
- More expensive but handles difficult sites
- Usage tracked and reported at end of crawl
Content Validation
Thehas_meaningful_content() function checks for:
- Non-empty body content
- Absence of common blocking indicators
- Minimum content threshold
Related Modules
- scout - URL normalization, link extraction, sitemap parsing
- text - Title/description/text extraction and snippet creation
- scraping_browser_client - Bright Data Scraping Browser integration
- state - Crawl state management (queue, visited set)
Notes
- All URLs are normalized before processing
- Visited tracking prevents duplicate crawls
- Queue implements BFS traversal
- Graceful degradation: continues even if some pages fail
- Warning logged if fewer pages found than requested