Intelligent Crawling

Overview

The llms.txt Generator uses an intelligent crawling system that automatically discovers and extracts content from websites. The crawler employs a breadth-first search (BFS) algorithm with sitemap support, configurable limits, and fallback mechanisms for JavaScript-heavy sites.

Key Features

Sitemap Discovery

Automatically detects and parses XML sitemaps for efficient crawling

BFS Traversal

Breadth-first search ensures comprehensive coverage of site structure

Smart Filtering

Skips irrelevant pages (admin, API, media files) automatically

Content Detection

Validates meaningful content and escalates to browser rendering when needed

How It Works

Sitemap Check

The crawler first attempts to locate a sitemap at /sitemap.xml or /sitemap_index.xml:

backend/crawler/llm_crawler.py

async def _try_sitemap(self, client: httpx.AsyncClient) -> list[str]:
    for path in ['/sitemap.xml', '/sitemap_index.xml']:
        try:
            resp = await client.get(f"{self.state.base_url}{path}")
            if resp.status_code == 200:
                return parse_sitemap(resp.text, self.state.base_url)
        except:
            continue
    return []

If found, URLs are extracted and filtered to match the base domain.

Queue Initialization

URLs are loaded into a queue for processing:

backend/crawler/llm_crawler.py

sitemap_urls = await self._try_sitemap(client)
if sitemap_urls:
    await self.log(f"Using sitemap: found {len(sitemap_urls)} URLs")
    self.state.queue.clear()
    self.state.queue.append(self.state.base_url)
    for url in sitemap_urls[:max_attempts]:
        if url != self.state.base_url:
            self.state.queue.append(url)
else:
    await self.log("No sitemap found, using BFS crawl")

BFS Crawling

Pages are visited in breadth-first order until limits are reached:

backend/crawler/llm_crawler.py

while (self.state.queue and
       len(pages) < self.state.max_pages and
       attempts < max_attempts):
    attempts += 1
    url = self.state.queue.popleft()

    if url in self.state.visited:
        continue

    # Fetch and parse page
    html = await self._fetch_page(url)
    soup = BeautifulSoup(html, 'html.parser')

    # Extract content
    title = extract_title(soup)
    description = extract_description(soup)
    text = extract_text(soup)
    snippet = create_snippet(text, self.desc_length)

    pages.append(PageInfo(
        url=url,
        title=title,
        description=description,
        snippet=snippet
    ))

    self.state.visited.add(url)

    # Discover new links
    links = extract_links(html, url)
    for link in links:
        if link not in self.state.visited:
            self.state.queue.append(link)

Link Discovery

Each page is parsed to extract internal links:

backend/crawler/scout.py

def extract_links(html: str, base_url: str) -> list[str]:
    soup = BeautifulSoup(html, 'html.parser')
    links = []

    for tag in soup.find_all('a', href=True):
        href = tag['href']
        absolute_url = urljoin(base_url, href)
        normalized = normalize_url(absolute_url)

        if same_domain(normalized, base_url) and not should_skip(normalized):
            links.append(normalized)

    return list(set(links))

Smart Filtering

The crawler automatically skips irrelevant URLs using pattern matching:

backend/crawler/scout.py

SKIP_PATTERNS = [
    '/logout', '/login', '/admin', '/api/', '/jobs/', '/sitemap',
    '/_metadata', '.pdf', '.zip', '.jpg', '.png', '.gif', '.xml',
    '/feed', '/rss', '.atom'
]

def should_skip(url: str) -> bool:
    parsed = urlparse(url)
    # Skip URLs with long query strings
    if parsed.query and len(parsed.query) > 50:
        return True
    return any(pattern in url for pattern in SKIP_PATTERNS)

Why filter? Skipping administrative pages, media files, and API endpoints ensures the crawler focuses on human-readable content that’s relevant for LLM indexing.

Content Detection

The crawler validates that fetched pages contain meaningful content:

backend/crawler/llm_crawler.py

try:
    await self.log(f"  → Trying httpx...")
    response = await client.get(url)
    response.raise_for_status()
    httpx_html = response.text

    if has_meaningful_content(httpx_html):
        await self.log(f"  ✓ httpx succeeded")
        html = httpx_html
    else:
        await self.log(f"  ✗ httpx returned empty/blocked content")
except Exception as httpx_error:
    await self.log(f"  ✗ httpx failed: {str(httpx_error)}")

# Escalate to Bright Data if needed
if html is None and self.brightdata_client is not None:
    await self.log(f"  → Escalating to Bright Data Scraping Browser...")
    html = await self.brightdata_client.fetch(url)

For JavaScript-heavy sites that return empty content, the crawler automatically escalates to Bright Data’s Scraping Browser with full browser rendering.

Configuration

Crawl behavior can be customized via WebSocket parameters:

maxPages

integer

default:"50"

Maximum number of pages to crawl before stopping

descLength

integer

default:"500"

Character limit for page descriptions/snippets

useBrightdata

boolean

default:"true"

Enable Bright Data Scraping Browser for JavaScript-heavy sites

Example Usage

const ws = new WebSocket('wss://api.llmstxt.cloud/ws/crawl?token=YOUR_TOKEN');

ws.onopen = () => {
  ws.send(JSON.stringify({
    url: 'https://example.com',
    maxPages: 100,
    descLength: 500,
    useBrightdata: true
  }));
};

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  
  if (message.type === 'log') {
    console.log(message.content);
    // "Using sitemap: found 234 URLs"
    // "Visiting: https://example.com/docs"
    // "✓ httpx succeeded"
  }
};

Crawl State Management

The crawler maintains state to prevent duplicate visits:

backend/crawler/state.py

from dataclasses import dataclass, field
from collections import deque
from typing import Set, Deque

@dataclass
class CrawlState:
    base_url: str
    max_pages: int
    visited: Set[str] = field(default_factory=set)
    queue: Deque[str] = field(default_factory=deque)

visited: Set of already-crawled URLs (prevents loops)
queue: Deque for BFS ordering (FIFO)
base_url: Root domain for same-domain filtering

Performance Characteristics

Sitemap Mode

Speed: Fast (parallel sitemap parsing)
Coverage: Complete (all sitemap URLs)
Best for: Large sites with comprehensive sitemaps

BFS Mode

Speed: Moderate (sequential page crawling)
Coverage: Limited by max_pages
Best for: Small sites or those without sitemaps

Best Practices

Set appropriate page limits

Start with 50 pages for small sites, increase to 200-500 for larger documentation sites. Very high limits (>1000) may time out.

Use sitemap hints

If you know a site has a sitemap, the crawler will automatically find it. For sitemap indexes, all child sitemaps are parsed recursively.

Enable Bright Data for SPAs

Single-page applications (React, Vue, Angular) require JavaScript rendering. Set useBrightdata: true for these sites.

Monitor crawl logs

Watch for ✗ httpx returned empty/blocked content messages - these indicate pages that need browser rendering.

Troubleshooting

Crawler only finds a few pages

Cause: Site may block the user agent or require JavaScript rendering.Solution: Enable useBrightdata: true to use browser-based crawling.

Crawl times out before reaching maxPages

Cause: Each page takes time to fetch and parse. Very high page counts may exceed WebSocket timeout.Solution: Reduce maxPages or increase timeout in your WebSocket client.

Wrong pages are being crawled

Cause: BFS discovers links from navigation/footer that aren’t relevant.Solution: The skip patterns filter most issues. For additional filtering, consider adjusting SKIP_PATTERNS in crawler/scout.py.

Get Started

Core Features

Guides

Deployment

Overview

Key Features

Sitemap Discovery

BFS Traversal

Smart Filtering

Content Detection

How It Works

Smart Filtering

Content Detection

Configuration

Example Usage

Crawl State Management

Performance Characteristics

Sitemap Mode

BFS Mode

Best Practices

Troubleshooting

Next Steps

Real-time Streaming

LLM Enhancement

Build docs developers (and LLMs) love

Get Started

Core Features

Guides

Deployment

​Overview

​Key Features

Sitemap Discovery

BFS Traversal

Smart Filtering

Content Detection

​How It Works

​Smart Filtering

​Content Detection

​Configuration

​Example Usage

​Crawl State Management

​Performance Characteristics

Sitemap Mode

BFS Mode

​Best Practices

​Troubleshooting

​Next Steps

Real-time Streaming

LLM Enhancement

Build docs developers (and LLMs) love

Overview

Key Features

How It Works

Smart Filtering

Content Detection

Configuration

Example Usage

Crawl State Management

Performance Characteristics

Best Practices

Troubleshooting

Next Steps