The llms.txt Generator exposes a WebSocket API that allows programmatic generation of llms.txt files. This guide covers authentication, message formats, and implementation patterns.
Overview
The API uses WebSockets for real-time bidirectional communication, allowing you to:
Send crawl requests with custom parameters
Receive real-time progress updates
Get the generated llms.txt content
Retrieve hosted CDN URLs
Authentication
The API supports two authentication methods:
JWT Token (Recommended)
API Key (Direct)
Generate a short-lived token from the /auth/token endpoint.
Request a token
curl -X POST https://your-backend.com/auth/token \
-H "X-API-Key: your-api-key"
Response: {
"token" : "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..." ,
"expires_in" : 300
}
Tokens are valid for 5 minutes (300 seconds).
Connect with token
const token = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..." ;
const ws = new WebSocket ( `wss://your-backend.com/ws/crawl?token= ${ token } ` );
Use your API key directly in the WebSocket query string. const apiKey = "your-api-key" ;
const ws = new WebSocket ( `wss://your-backend.com/ws/crawl?api_key= ${ apiKey } ` );
This method exposes your API key in the connection URL. Use JWT tokens for production applications.
Setting Up API Key
Configure authentication in your backend .env file:
API_KEY = your-generated-api-key
Generate a secure API key:
WebSocket Endpoint
URL: wss://your-backend.com/ws/crawl
Query Parameters:
token (string, optional): JWT authentication token
api_key (string, optional): Direct API key authentication
One authentication method (token or api_key) is required unless API_KEY is not configured in the backend.
After establishing the WebSocket connection, send a JSON payload to initiate crawling:
Message Schema
The base URL of the website to crawl. Must include protocol (http:// or https://). Example: "https://example.com"
Maximum number of pages to crawl. Range: 1-200Example: 50
Character limit for page description excerpts. Range: 100-2000Example: 500
Enable scheduled recrawls for this site. Requires Supabase configuration.
Minutes between scheduled recrawls (default: 7 days). Only used when enableAutoUpdate is true.
Enable LLM-powered content enhancement. Requires LLM_ENHANCEMENT_ENABLED=true in backend config.
Use Brightdata proxy for JavaScript rendering. Falls back to the backend’s BRIGHTDATA_ENABLED setting if not specified.
Example Request
Basic Request
With Auto-Update
Full Configuration
{
"url" : "https://example.com" ,
"maxPages" : 50 ,
"descLength" : 500
}
The server sends JSON messages with a type and content field:
Message Types
Progress updates and informational messages. {
"type" : "log" ,
"content" : "Crawling page 5/50: API Documentation"
}
The complete generated llms.txt content. {
"type" : "result" ,
"content" : "# Example Site \n\n > A comprehensive platform... \n\n ## Documentation \n ..."
}
The public CDN URL where the llms.txt file is hosted. {
"type" : "url" ,
"content" : "https://pub-abc123.r2.dev/example-com-xyz789.txt"
}
Error messages when something goes wrong. {
"type" : "error" ,
"content" : "Failed to fetch https://example.com: Connection timeout"
}
Implementation Examples
JavaScript/TypeScript
import { useState , useCallback , useRef } from 'react' ;
interface CrawlPayload {
url : string ;
maxPages : number ;
descLength : number ;
enableAutoUpdate ?: boolean ;
recrawlIntervalMinutes ?: number ;
llmEnhance ?: boolean ;
useBrightdata ?: boolean ;
}
export function useLLMSTxtGenerator () {
const [ logs , setLogs ] = useState < string []>([]);
const [ result , setResult ] = useState < string >( "" );
const [ hostedUrl , setHostedUrl ] = useState < string >( "" );
const [ isGenerating , setIsGenerating ] = useState ( false );
const wsRef = useRef < WebSocket | null >( null );
const generate = useCallback ( async ( payload : CrawlPayload ) => {
setLogs ([ "Connecting..." ]);
setResult ( "" );
setHostedUrl ( "" );
setIsGenerating ( true );
try {
// Get JWT token
const tokenRes = await fetch ( '/api/auth/token' , { method: 'POST' });
const { token } = await tokenRes . json ();
// Connect to WebSocket
const ws = new WebSocket (
`wss://your-backend.com/ws/crawl?token= ${ token } `
);
wsRef . current = ws ;
ws . onopen = () => {
setLogs ( prev => [ ... prev , `Starting crawl of ${ payload . url } ...` ]);
ws . send ( JSON . stringify ( payload ));
};
ws . onmessage = ( event ) => {
const data = JSON . parse ( event . data );
switch ( data . type ) {
case "log" :
setLogs ( prev => [ ... prev , data . content ]);
break ;
case "result" :
setResult ( data . content );
break ;
case "url" :
setHostedUrl ( data . content );
break ;
case "error" :
setLogs ( prev => [ ... prev , `ERROR: ${ data . content } ` ]);
break ;
}
};
ws . onerror = () => {
setLogs ( prev => [ ... prev , "Connection error" ]);
setIsGenerating ( false );
};
ws . onclose = () => {
setIsGenerating ( false );
};
} catch ( error ) {
setLogs ( prev => [ ... prev , `Error: ${ error } ` ]);
setIsGenerating ( false );
}
}, []);
const cancel = useCallback (() => {
wsRef . current ?. close ();
wsRef . current = null ;
setIsGenerating ( false );
}, []);
return { logs , result , hostedUrl , isGenerating , generate , cancel };
}
Python
import asyncio
import json
import websockets
import httpx
from typing import Optional, Callable
class LLMSTxtGenerator :
def __init__ ( self , backend_url : str , api_key : str ):
self .backend_url = backend_url
self .api_key = api_key
self .ws_url = backend_url.replace( 'https://' , 'wss://' ).replace( 'http://' , 'ws://' )
async def get_token ( self ) -> str :
"""Get JWT token for authentication."""
async with httpx.AsyncClient() as client:
response = await client.post(
f " { self .backend_url } /auth/token" ,
headers = { "X-API-Key" : self .api_key}
)
response.raise_for_status()
return response.json()[ "token" ]
async def generate (
self ,
url : str ,
max_pages : int = 50 ,
desc_length : int = 500 ,
enable_auto_update : bool = False ,
recrawl_interval_minutes : int = 10080 ,
llm_enhance : bool = False ,
use_brightdata : bool = True ,
on_log : Optional[Callable[[ str ], None ]] = None
) -> tuple[ str , Optional[ str ]]:
"""Generate llms.txt for a website.
Returns:
(result, hosted_url) tuple
"""
token = await self .get_token()
ws_url = f " { self .ws_url } /ws/crawl?token= { token } "
result = None
hosted_url = None
async with websockets.connect(ws_url) as websocket:
# Send crawl request
await websocket.send(json.dumps({
"url" : url,
"maxPages" : max_pages,
"descLength" : desc_length,
"enableAutoUpdate" : enable_auto_update,
"recrawlIntervalMinutes" : recrawl_interval_minutes,
"llmEnhance" : llm_enhance,
"useBrightdata" : use_brightdata
}))
# Receive messages
async for message in websocket:
data = json.loads(message)
msg_type = data.get( "type" )
content = data.get( "content" )
if msg_type == "log" :
if on_log:
on_log(content)
else :
print ( f "[LOG] { content } " )
elif msg_type == "result" :
result = content
elif msg_type == "url" :
hosted_url = content
elif msg_type == "error" :
raise Exception ( f "Crawl error: { content } " )
return result, hosted_url
# Usage
async def main ():
generator = LLMSTxtGenerator(
backend_url = "https://your-backend.com" ,
api_key = "your-api-key"
)
result, hosted_url = await generator.generate(
url = "https://example.com" ,
max_pages = 50 ,
desc_length = 500 ,
enable_auto_update = True
)
print ( "Generated llms.txt:" )
print (result)
print ( f " \n Hosted at: { hosted_url } " )
if __name__ == "__main__" :
asyncio.run(main())
Error Handling
Connection Errors
WebSocket Errors
Python Exception Handling
ws . onerror = ( error ) => {
console . error ( 'WebSocket error:' , error );
// Handle connection failures
};
ws . onclose = ( event ) => {
if ( event . code === 1008 ) {
console . error ( 'Authentication failed' );
} else if ( event . code === 1006 ) {
console . error ( 'Connection closed abnormally' );
}
};
Server-Side Errors
The server sends error messages with type: "error":
{
"type" : "error" ,
"content" : "Failed to fetch https://example.com: Connection timeout"
}
Common error messages:
"Failed to fetch <url>: Connection timeout" - Target site is unreachable
"Invalid URL format" - URL validation failed
"Max pages must be between 1 and 200" - Invalid parameter
"Crawl interrupted" - Unexpected crawl termination
Rate Limiting
The API does not currently implement rate limiting at the application level. Consider implementing rate limiting in your client code or using a reverse proxy (CloudFlare, nginx) for production deployments.
Client-side rate limiting example:
class RateLimitedGenerator {
constructor ( maxConcurrent = 3 ) {
this . maxConcurrent = maxConcurrent ;
this . active = 0 ;
this . queue = [];
}
async generate ( config ) {
if ( this . active >= this . maxConcurrent ) {
await new Promise ( resolve => this . queue . push ( resolve ));
}
this . active ++ ;
try {
return await actualGenerate ( config );
} finally {
this . active -- ;
if ( this . queue . length > 0 ) {
this . queue . shift ()();
}
}
}
}
Testing
Using wscat
Test the WebSocket API from the command line:
# Connect with API key
wscat -c "wss://your-backend.com/ws/crawl?api_key=your-key"
# Send crawl request (after connection)
{ "url" : "https://example.com" , "maxPages" :10, "descLength" :300}
Health Check
Verify the backend is running:
curl https://your-backend.com/health
Expected response:
Next Steps
Configuration Learn about all environment variables and settings
Web Interface Use the user-friendly web UI instead of the API
API Reference View the complete API specification
Deployment Deploy your own instance