Crawler

The crawler module provides the core crawling functionality for the llms.txt Generator. It handles webpage fetching, content extraction, and intelligent fallback between httpx and Bright Data Scraping Browser.

Overview

The LLMCrawler class crawls websites starting from a base URL, extracting meaningful content from each page. It supports:

Sitemap-based crawling (tries /sitemap.xml and /sitemap_index.xml)
Breadth-first search (BFS) fallback when no sitemap exists
Dual-fetching strategy: httpx → Bright Data Scraping Browser
Content validation to detect blocked/empty responses
Configurable page limits and description lengths

Classes

PageInfo

Data class representing extracted information from a single page.

url

str

required

The full URL of the page

title

str

required

Extracted page title (from <title> or <h1>)

description

str

required

Page description (from meta tags or content)

snippet

str

required

Truncated text snippet from page content

LLMCrawler

Main crawler class that orchestrates the crawling process.

Constructor

LLMCrawler(
    base_url: str,
    max_pages: int,
    desc_length: int,
    log_callback: Callable,
    brightdata_api_key: str | None = None,
    brightdata_enabled: bool = True,
    brightdata_zone: str = "scraping_browser1",
    brightdata_password: str | None = None
)

base_url

str

required

The starting URL to crawl (will be normalized)

max_pages

int

required

Maximum number of pages to crawl

desc_length

int

required

Maximum length for generated snippets

log_callback

Callable

required

Async function for logging progress (receives string messages)

brightdata_api_key

str | None

default:"None"

API key for Bright Data Scraping Browser

brightdata_enabled

bool

default:"True"

Whether to enable Bright Data fallback

brightdata_zone

str

default:"scraping_browser1"

Bright Data zone to use

brightdata_password

str | None

default:"None"

Optional password for Bright Data authentication

Methods

run()

Executes the crawl and returns extracted page information.

async def run() -> list[PageInfo]

pages

list[PageInfo]

List of successfully crawled pages with extracted content

Behavior:

Attempts to load sitemap (/sitemap.xml or /sitemap_index.xml)
If sitemap exists, uses those URLs; otherwise performs BFS crawl
For each URL:
- First tries httpx (fast)
- Validates content for meaningful data
- Falls back to Bright Data if needed
- Extracts title, description, and text
- Discovers and queues new links
Stops when max_pages reached or queue exhausted
Attempts up to max_pages * 3 URLs to handle failures
Logs usage stats if Bright Data was used

Usage Examples

Basic Crawl

from backend.crawler import LLMCrawler

async def log(msg: str):
    print(msg)

# Create crawler
crawler = LLMCrawler(
    base_url="https://example.com",
    max_pages=20,
    desc_length=150,
    log_callback=log
)

# Run the crawl
pages = await crawler.run()

for page in pages:
    print(f"{page.title}: {page.url}")
    print(f"  {page.snippet}")

With Bright Data Fallback

crawler = LLMCrawler(
    base_url="https://example.com",
    max_pages=50,
    desc_length=200,
    log_callback=log,
    brightdata_api_key="your_api_key_here",
    brightdata_enabled=True
)

pages = await crawler.run()

# Check logs for Bright Data usage stats:
# "Scraping Browser usage: 5 requests, 4 successful (80%), ~$0.05 estimated cost"

Sitemap-First Crawl

# If site has /sitemap.xml, crawler will use it automatically
crawler = LLMCrawler(
    base_url="https://docs.example.com",
    max_pages=100,
    desc_length=250,
    log_callback=log
)

pages = await crawler.run()

# Logs will show:
# "Using sitemap: found 150 URLs"
# Then crawls up to max_pages from those URLs

Fetching Strategy

The crawler uses a two-tier fetching strategy:

Tier 1: httpx (Fast)

Uses standard HTTP client with 10s timeout
Follows redirects automatically
Validates response has meaningful content
Most cost-effective approach

Tier 2: Bright Data Scraping Browser (Reliable)

Triggered when httpx fails or returns empty content
Executes JavaScript, handles anti-bot measures
More expensive but handles difficult sites
Usage tracked and reported at end of crawl

Content Validation

The has_meaningful_content() function checks for:

Non-empty body content
Absence of common blocking indicators
Minimum content threshold

scout - URL normalization, link extraction, sitemap parsing
text - Title/description/text extraction and snippet creation
scraping_browser_client - Bright Data Scraping Browser integration
state - Crawl state management (queue, visited set)

Notes

All URLs are normalized before processing
Visited tracking prevents duplicate crawls
Queue implements BFS traversal
Graceful degradation: continues even if some pages fail
Warning logged if fewer pages found than requested

Endpoints

Backend Modules

Overview

Classes

PageInfo

LLMCrawler

Constructor

Methods

run()

Usage Examples

Basic Crawl

With Bright Data Fallback

Sitemap-First Crawl

Fetching Strategy

Tier 1: httpx (Fast)

Tier 2: Bright Data Scraping Browser (Reliable)

Content Validation

Notes

Build docs developers (and LLMs) love

Endpoints

Backend Modules

​Overview

​Classes

​PageInfo

​LLMCrawler

​Constructor

​Methods

run()

​Usage Examples

​Basic Crawl

​With Bright Data Fallback

​Sitemap-First Crawl

​Fetching Strategy

​Tier 1: httpx (Fast)

​Tier 2: Bright Data Scraping Browser (Reliable)

​Content Validation

​Related Modules

​Notes

Build docs developers (and LLMs) love

Overview

Classes

PageInfo

LLMCrawler

Constructor

Methods

Usage Examples

Basic Crawl

With Bright Data Fallback

Sitemap-First Crawl

Fetching Strategy

Tier 1: httpx (Fast)

Tier 2: Bright Data Scraping Browser (Reliable)

Content Validation

Related Modules

Notes