Real-time Streaming

Overview

The llms.txt Generator provides real-time feedback during crawling through WebSocket connections. Every step of the crawling process is streamed to the client, allowing users to monitor progress, debug issues, and receive results as they’re generated.

Why WebSockets?

Traditional HTTP requests are synchronous - you send a request and wait for a complete response. Crawling a website can take 30-120 seconds depending on size, making this a poor user experience.

HTTP Request

Single request/response
No progress updates
Long wait time
No visibility into errors

WebSocket Stream

Persistent bidirectional connection
Real-time log streaming
Immediate feedback
Live error reporting

WebSocket Endpoint

The API exposes a single WebSocket endpoint for crawling:

backend/main.py

@app.websocket("/ws/crawl")
async def websocket_crawl(websocket: WebSocket):
    # Authentication
    token = websocket.query_params.get("token")
    if token:
        if not validate_token(token):
            await websocket.close(code=1008, reason="Invalid or expired token")
            return
    elif settings.api_key:
        api_key = websocket.query_params.get("api_key")
        if api_key != settings.api_key:
            await websocket.close(code=1008, reason="Unauthorized")
            return

    await websocket.accept()

    try:
        # Receive crawl request
        data = await websocket.receive_text()
        payload = json.loads(data)

        url = str(payload['url'])
        max_pages = payload.get('maxPages', 50)
        desc_length = payload.get('descLength', 500)
        use_brightdata = payload.get('useBrightdata', settings.brightdata_enabled)

        # Log callback for real-time streaming
        async def log(message: str):
            await websocket.send_json({"type": "log", "content": message})

        # Start crawling with live logging
        crawler = LLMCrawler(
            url, max_pages, desc_length, log,
            brightdata_api_key=settings.brightdata_api_key,
            brightdata_enabled=use_brightdata,
            brightdata_zone=settings.brightdata_zone,
            brightdata_password=settings.brightdata_password
        )
        pages = await crawler.run()

        # Stream result
        await websocket.send_json({"type": "result", "content": llms_txt})

        # Stream hosted URL
        hosted_url = await save_llms_txt(url, llms_txt, log)
        if hosted_url:
            await websocket.send_json({"type": "url", "content": hosted_url})

    except WebSocketDisconnect:
        pass
    except Exception as e:
        await websocket.send_json({"type": "error", "content": str(e)})
    finally:
        await websocket.close()

The log callback is passed directly to the crawler, enabling real-time message streaming at every step of the process.

Message Types

The WebSocket sends JSON messages with different types:

Log Messages

Progress updates and status information:

{
  "type": "log",
  "content": "Using sitemap: found 143 URLs"
}

{
  "type": "log",
  "content": "Visiting: https://example.com/docs/api"
}

{
  "type": "log",
  "content": "  → Trying httpx..."
}

{
  "type": "log",
  "content": "  ✓ httpx succeeded"
}

Result Message

The complete llms.txt content:

{
  "type": "result",
  "content": "# Example.com\n\n## API Documentation\n\nhttps://example.com/docs/api\n\n> Complete API reference..."
}

URL Message

The hosted CDN URL for the generated file:

{
  "type": "url",
  "content": "https://pub-abc123.r2.dev/example.com/llms.txt"
}

Error Message

Any errors encountered during crawling:

{
  "type": "error",
  "content": "Failed to fetch content from https://example.com/broken"
}

Authentication

WebSocket connections require authentication via query parameters:

JWT Token (Recommended)
Direct API Key

Obtain a short-lived token from the /auth/token endpoint:

curl -X POST https://api.llmstxt.cloud/auth/token \
  -H "X-API-Key: your_api_key"

Response:

{
  "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "expires_in": 300
}

Use the token in WebSocket connection:

const ws = new WebSocket(
  'wss://api.llmstxt.cloud/ws/crawl?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...'
);

Pass the API key directly in query parameters:

const ws = new WebSocket(
  'wss://api.llmstxt.cloud/ws/crawl?api_key=your_api_key'
);

Only use this method in server-side code. Never expose API keys in client-side JavaScript.

Client Implementation

import { useEffect, useRef, useState } from 'react';

export function useCrawlWebSocket() {
  const [logs, setLogs] = useState([]);
  const [result, setResult] = useState(null);
  const [url, setUrl] = useState(null);
  const [error, setError] = useState(null);
  const [isConnected, setIsConnected] = useState(false);
  const wsRef = useRef(null);

  const connect = async (crawlConfig) => {
    // Get JWT token
    const tokenRes = await fetch('/api/token', {
      method: 'POST',
      headers: { 'X-API-Key': process.env.API_KEY }
    });
    const { token } = await tokenRes.json();

    // Connect to WebSocket
    const ws = new WebSocket(
      `wss://api.llmstxt.cloud/ws/crawl?token=${token}`
    );

    ws.onopen = () => {
      setIsConnected(true);
      ws.send(JSON.stringify(crawlConfig));
    };

    ws.onmessage = (event) => {
      const message = JSON.parse(event.data);

      switch (message.type) {
        case 'log':
          setLogs(prev => [...prev, message.content]);
          break;
        case 'result':
          setResult(message.content);
          break;
        case 'url':
          setUrl(message.content);
          break;
        case 'error':
          setError(message.content);
          break;
      }
    };

    ws.onerror = (error) => {
      setError('WebSocket error occurred');
      console.error(error);
    };

    ws.onclose = () => {
      setIsConnected(false);
    };

    wsRef.current = ws;
  };

  const disconnect = () => {
    if (wsRef.current) {
      wsRef.current.close();
      wsRef.current = null;
    }
  };

  return { logs, result, url, error, isConnected, connect, disconnect };
}

Log Streaming in Action

Here’s what a typical crawl log stream looks like:

[LOG] Using sitemap: found 143 URLs
[LOG] Visiting: https://example.com
[LOG]   → Trying httpx...
[LOG]   ✓ httpx succeeded
[LOG] Visiting: https://example.com/docs
[LOG]   → Trying httpx...
[LOG]   ✓ httpx succeeded
[LOG] Visiting: https://example.com/api
[LOG]   → Trying httpx...
[LOG]   ✗ httpx returned empty/blocked content
[LOG]   → Escalating to Bright Data Scraping Browser...
[LOG]   ✓ Scraping Browser succeeded
[LOG] Visiting: https://example.com/guides
[LOG]   → Trying httpx...
[LOG]   ✓ httpx succeeded
[LOG] Crawl complete: 50 pages
[LOG] Checking for .md versions of pages...
[LOG] Found 12 pages with .md versions
[LOG] Uploading to R2...
[LOG] Upload complete: https://pub-abc123.r2.dev/example.com/llms.txt
[RESULT] # Example.com\n\n...
[URL] https://pub-abc123.r2.dev/example.com/llms.txt

Error Handling

WebSocket errors should be handled gracefully:

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
  // Show user-friendly error message
};

ws.onclose = (event) => {
  if (event.code === 1008) {
    // Authentication failure
    console.error('Authentication failed:', event.reason);
  } else if (event.code !== 1000) {
    // Abnormal closure
    console.error('Connection closed unexpectedly:', event.code);
  }
};

WebSocket connections have a timeout. If the crawl takes longer than the configured timeout (typically 5-10 minutes), the connection will be closed. Consider implementing reconnection logic for long-running crawls.

Connection Lifecycle

Authentication

Client obtains a JWT token or uses API key directly

Connection

WebSocket connection established with authentication in query params

Request

Client sends JSON payload with crawl configuration

Streaming

Server streams log messages, progress updates, and status information

Result

Complete llms.txt content is sent when crawling completes

URL

Hosted CDN URL is sent after successful upload

Closure

Connection closes gracefully after all data is transmitted

Best Practices

Use JWT tokens in production

JWT tokens expire after 5 minutes, limiting the damage if intercepted. Generate fresh tokens for each crawl session.

Buffer log messages

If logs arrive faster than they can be displayed, buffer them and render in batches to avoid UI performance issues.

Implement reconnection

For long-running crawls, implement exponential backoff reconnection logic to handle temporary network issues.

Handle partial results

If the connection drops before completion, you may receive partial results. Store intermediate data and allow resume functionality.

Get Started

Core Features

Guides

Deployment

Overview

Why WebSockets?

HTTP Request

WebSocket Stream

WebSocket Endpoint

Message Types

Log Messages

Result Message

URL Message

Error Message

Authentication

Client Implementation

Log Streaming in Action

Error Handling

Connection Lifecycle

Best Practices

Next Steps

Intelligent Crawling

Auto Updates

Build docs developers (and LLMs) love

Get Started

Core Features

Guides

Deployment

​Overview

​Why WebSockets?

HTTP Request

WebSocket Stream

​WebSocket Endpoint

​Message Types

​Log Messages

​Result Message

​URL Message

​Error Message

​Authentication

​Client Implementation

​Log Streaming in Action

​Error Handling

​Connection Lifecycle

​Best Practices

​Next Steps

Intelligent Crawling

Auto Updates

Build docs developers (and LLMs) love

Overview

Why WebSockets?

WebSocket Endpoint

Message Types

Log Messages

Result Message

URL Message

Error Message

Authentication

Client Implementation

Log Streaming in Action

Error Handling

Connection Lifecycle

Best Practices

Next Steps