WebSocket Crawl Endpoint

Overview

The WebSocket endpoint provides real-time bidirectional communication for crawling websites and generating llms.txt files. It streams progress updates, logs, and results as the crawl happens.

Endpoint

WS /ws/crawl

Authentication

The endpoint supports two authentication methods:

Method 1: JWT Token (Recommended)

Obtain a short-lived JWT token from the /auth/token endpoint and pass it as a query parameter:

ws://your-backend.com/ws/crawl?token=YOUR_JWT_TOKEN

Method 2: API Key

Pass your API key directly as a query parameter:

ws://your-backend.com/ws/crawl?api_key=YOUR_API_KEY

API keys are long-lived credentials. For production use, prefer JWT tokens which expire after 5 minutes.

Request Format

After connecting, send a JSON payload with the crawl configuration:

url

string

required

The base URL of the website to crawl. Must be a valid HTTP/HTTPS URL.Example: "https://example.com"

maxPages

integer

default:50

Maximum number of pages to crawl. Used to prevent excessive crawling.Range: 1-1000

descLength

integer

default:500

Maximum length of description excerpts in characters. Truncated at semantic boundaries.Range: 100-2000

enableAutoUpdate

boolean

default:false

Enable automatic periodic recrawls for this site. Stores site metadata in the database.

recrawlIntervalMinutes

integer

default:10080

Minutes between automatic recrawls (default: 7 days). Only used if enableAutoUpdate is true.Common values:

360 (6 hours)
1440 (1 day)
10080 (1 week)

llmEnhance

boolean

default:false

Use AI (Grok 4.1-Fast) to enhance and optimize the generated llms.txt content. Requires OPENROUTER_API_KEY and LLM_ENHANCEMENT_ENABLED=true.

useBrightdata

boolean

default:true

Use Brightdata’s Scraping Browser for JavaScript-heavy sites. Falls back to Playwright if unavailable.

Example Request

{
  "url": "https://docs.example.com",
  "maxPages": 100,
  "descLength": 600,
  "enableAutoUpdate": true,
  "recrawlIntervalMinutes": 1440,
  "llmEnhance": false,
  "useBrightdata": true
}

Response Format

The server sends JSON messages with different types throughout the crawl process:

Log Messages

type

string

Always "log" for progress updates

content

string

Human-readable log message describing the current operation

{
  "type": "log",
  "content": "Crawling page 5/100..."
}

Result Message

type

string

Always "result" for the generated llms.txt content

content

string

The complete generated llms.txt file in Markdown format

{
  "type": "result",
  "content": "# Example Documentation\n\n> https://docs.example.com\n\n## Getting Started\n\n> https://docs.example.com/start\n\nLearn how to..."
}

URL Message

type

string

Always "url" for the hosted file URL

content

string

Public CDN URL where the llms.txt file is hosted (Cloudflare R2)

{
  "type": "url",
  "content": "https://pub-abc123.r2.dev/llms/example-com.txt"
}

Error Message

type

string

Always "error" for error conditions

content

string

Error description

{
  "type": "error",
  "content": "Failed to fetch page: Connection timeout"
}

Connection Flow

Connect: Open WebSocket with authentication parameter
Authenticate: Server validates token/API key
Send Request: Client sends crawl configuration JSON
Receive Logs: Server streams progress updates in real-time
Receive Result: Server sends complete llms.txt content
Receive URL: Server sends hosted file URL (if R2 configured)
Close: Connection closes automatically after completion

Error Handling

Authentication Errors

Invalid Token

WebSocket closed with code 1008: Invalid or expired token

Missing API Key

WebSocket closed with code 1008: Unauthorized

Runtime Errors

Runtime errors are sent as JSON error messages before the connection closes:

{
  "type": "error",
  "content": "Invalid URL format"
}

Common error messages:

"Invalid URL format"
"Failed to fetch page: [details]"
"Crawl timeout exceeded"
"Maximum pages limit reached"

Example Implementation

JavaScript/TypeScript

const ws = new WebSocket('wss://api.example.com/ws/crawl?token=YOUR_TOKEN');

ws.onopen = () => {
  // Send crawl request
  ws.send(JSON.stringify({
    url: 'https://docs.example.com',
    maxPages: 50,
    descLength: 500,
    enableAutoUpdate: false
  }));
};

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  
  switch (message.type) {
    case 'log':
      console.log('Progress:', message.content);
      break;
    case 'result':
      console.log('Generated llms.txt:', message.content);
      break;
    case 'url':
      console.log('Hosted at:', message.content);
      break;
    case 'error':
      console.error('Error:', message.content);
      break;
  }
};

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
};

ws.onclose = (event) => {
  console.log('Connection closed:', event.code, event.reason);
};

Python

import asyncio
import websockets
import json

async def crawl_site():
    uri = "wss://api.example.com/ws/crawl?token=YOUR_TOKEN"
    
    async with websockets.connect(uri) as websocket:
        # Send crawl request
        await websocket.send(json.dumps({
            "url": "https://docs.example.com",
            "maxPages": 50,
            "descLength": 500,
            "enableAutoUpdate": False
        }))
        
        # Receive messages
        async for message in websocket:
            data = json.loads(message)
            
            if data["type"] == "log":
                print(f"Progress: {data['content']}")
            elif data["type"] == "result":
                print(f"Generated: {data['content'][:100]}...")
            elif data["type"] == "url":
                print(f"Hosted at: {data['content']}")
            elif data["type"] == "error":
                print(f"Error: {data['content']}")

asyncio.run(crawl_site())

Rate Limits

No explicit rate limits on the WebSocket endpoint
Crawling is limited by maxPages parameter
Consider backend resource usage when setting high maxPages values
Use enableAutoUpdate to avoid manual repeated crawls

Best Practices

Use JWT tokens instead of API keys for better security
Set reasonable maxPages limits (50-100 for most sites)
Enable auto-update for sites that change frequently
Handle all message types in your client code
Implement reconnection logic for production use
Validate URLs before sending to prevent errors
Use Brightdata (useBrightdata: true) for JavaScript-heavy sites

Authentication - Obtain JWT tokens
Webhooks - Trigger recrawls via webhook

Endpoints

Backend Modules

Overview

Endpoint

Authentication

Method 1: JWT Token (Recommended)

Method 2: API Key

Request Format

Example Request

Response Format

Log Messages

Result Message

URL Message

Error Message

Connection Flow

Error Handling

Authentication Errors

Runtime Errors

Example Implementation

JavaScript/TypeScript

Python

Rate Limits

Best Practices

Build docs developers (and LLMs) love

Endpoints

Backend Modules

​Overview

​Endpoint

​Authentication

​Method 1: JWT Token (Recommended)

​Method 2: API Key

​Request Format

​Example Request

​Response Format

​Log Messages

​Result Message

​URL Message

​Error Message

​Connection Flow

​Error Handling

​Authentication Errors

​Runtime Errors

​Example Implementation

​JavaScript/TypeScript

​Python

​Rate Limits

​Best Practices

​Related Endpoints

Build docs developers (and LLMs) love

Overview

Endpoint

Authentication

Method 1: JWT Token (Recommended)

Method 2: API Key

Request Format

Example Request

Response Format

Log Messages

Result Message

URL Message

Error Message

Connection Flow

Error Handling

Authentication Errors

Runtime Errors

Example Implementation

JavaScript/TypeScript

Python

Rate Limits

Best Practices

Related Endpoints