Overview
The llms.txt Generator provides real-time feedback during crawling through WebSocket connections. Every step of the crawling process is streamed to the client, allowing users to monitor progress, debug issues, and receive results as they’re generated.
Why WebSockets?
Traditional HTTP requests are synchronous - you send a request and wait for a complete response. Crawling a website can take 30-120 seconds depending on size, making this a poor user experience.
HTTP Request
Single request/response
No progress updates
Long wait time
No visibility into errors
WebSocket Stream
Persistent bidirectional connection
Real-time log streaming
Immediate feedback
Live error reporting
WebSocket Endpoint
The API exposes a single WebSocket endpoint for crawling:
@app.websocket ( "/ws/crawl" )
async def websocket_crawl ( websocket : WebSocket):
# Authentication
token = websocket.query_params.get( "token" )
if token:
if not validate_token(token):
await websocket.close( code = 1008 , reason = "Invalid or expired token" )
return
elif settings.api_key:
api_key = websocket.query_params.get( "api_key" )
if api_key != settings.api_key:
await websocket.close( code = 1008 , reason = "Unauthorized" )
return
await websocket.accept()
try :
# Receive crawl request
data = await websocket.receive_text()
payload = json.loads(data)
url = str (payload[ 'url' ])
max_pages = payload.get( 'maxPages' , 50 )
desc_length = payload.get( 'descLength' , 500 )
use_brightdata = payload.get( 'useBrightdata' , settings.brightdata_enabled)
# Log callback for real-time streaming
async def log ( message : str ):
await websocket.send_json({ "type" : "log" , "content" : message})
# Start crawling with live logging
crawler = LLMCrawler(
url, max_pages, desc_length, log,
brightdata_api_key = settings.brightdata_api_key,
brightdata_enabled = use_brightdata,
brightdata_zone = settings.brightdata_zone,
brightdata_password = settings.brightdata_password
)
pages = await crawler.run()
# Stream result
await websocket.send_json({ "type" : "result" , "content" : llms_txt})
# Stream hosted URL
hosted_url = await save_llms_txt(url, llms_txt, log)
if hosted_url:
await websocket.send_json({ "type" : "url" , "content" : hosted_url})
except WebSocketDisconnect:
pass
except Exception as e:
await websocket.send_json({ "type" : "error" , "content" : str (e)})
finally :
await websocket.close()
The log callback is passed directly to the crawler, enabling real-time message streaming at every step of the process.
Message Types
The WebSocket sends JSON messages with different types:
Log Messages
Progress updates and status information:
{
"type" : "log" ,
"content" : "Using sitemap: found 143 URLs"
}
{
"type" : "log" ,
"content" : "Visiting: https://example.com/docs/api"
}
{
"type" : "log" ,
"content" : " → Trying httpx..."
}
{
"type" : "log" ,
"content" : " ✓ httpx succeeded"
}
Result Message
The complete llms.txt content:
{
"type" : "result" ,
"content" : "# Example.com \n\n ## API Documentation \n\n https://example.com/docs/api \n\n > Complete API reference..."
}
URL Message
The hosted CDN URL for the generated file:
{
"type" : "url" ,
"content" : "https://pub-abc123.r2.dev/example.com/llms.txt"
}
Error Message
Any errors encountered during crawling:
{
"type" : "error" ,
"content" : "Failed to fetch content from https://example.com/broken"
}
Authentication
WebSocket connections require authentication via query parameters:
JWT Token (Recommended)
Direct API Key
Obtain a short-lived token from the /auth/token endpoint: curl -X POST https://api.llmstxt.cloud/auth/token \
-H "X-API-Key: your_api_key"
Response: {
"token" : "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..." ,
"expires_in" : 300
}
Use the token in WebSocket connection: const ws = new WebSocket (
'wss://api.llmstxt.cloud/ws/crawl?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...'
);
Pass the API key directly in query parameters: const ws = new WebSocket (
'wss://api.llmstxt.cloud/ws/crawl?api_key=your_api_key'
);
Only use this method in server-side code. Never expose API keys in client-side JavaScript.
Client Implementation
React Hook
Python Client
cURL (Simple Test)
import { useEffect , useRef , useState } from 'react' ;
export function useCrawlWebSocket () {
const [ logs , setLogs ] = useState ([]);
const [ result , setResult ] = useState ( null );
const [ url , setUrl ] = useState ( null );
const [ error , setError ] = useState ( null );
const [ isConnected , setIsConnected ] = useState ( false );
const wsRef = useRef ( null );
const connect = async ( crawlConfig ) => {
// Get JWT token
const tokenRes = await fetch ( '/api/token' , {
method: 'POST' ,
headers: { 'X-API-Key' : process . env . API_KEY }
});
const { token } = await tokenRes . json ();
// Connect to WebSocket
const ws = new WebSocket (
`wss://api.llmstxt.cloud/ws/crawl?token= ${ token } `
);
ws . onopen = () => {
setIsConnected ( true );
ws . send ( JSON . stringify ( crawlConfig ));
};
ws . onmessage = ( event ) => {
const message = JSON . parse ( event . data );
switch ( message . type ) {
case 'log' :
setLogs ( prev => [ ... prev , message . content ]);
break ;
case 'result' :
setResult ( message . content );
break ;
case 'url' :
setUrl ( message . content );
break ;
case 'error' :
setError ( message . content );
break ;
}
};
ws . onerror = ( error ) => {
setError ( 'WebSocket error occurred' );
console . error ( error );
};
ws . onclose = () => {
setIsConnected ( false );
};
wsRef . current = ws ;
};
const disconnect = () => {
if ( wsRef . current ) {
wsRef . current . close ();
wsRef . current = null ;
}
};
return { logs , result , url , error , isConnected , connect , disconnect };
}
Log Streaming in Action
Here’s what a typical crawl log stream looks like:
[LOG] Using sitemap: found 143 URLs
[LOG] Visiting: https://example.com
[LOG] → Trying httpx...
[LOG] ✓ httpx succeeded
[LOG] Visiting: https://example.com/docs
[LOG] → Trying httpx...
[LOG] ✓ httpx succeeded
[LOG] Visiting: https://example.com/api
[LOG] → Trying httpx...
[LOG] ✗ httpx returned empty/blocked content
[LOG] → Escalating to Bright Data Scraping Browser...
[LOG] ✓ Scraping Browser succeeded
[LOG] Visiting: https://example.com/guides
[LOG] → Trying httpx...
[LOG] ✓ httpx succeeded
[LOG] Crawl complete: 50 pages
[LOG] Checking for .md versions of pages...
[LOG] Found 12 pages with .md versions
[LOG] Uploading to R2...
[LOG] Upload complete: https://pub-abc123.r2.dev/example.com/llms.txt
[RESULT] # Example.com\n\n...
[URL] https://pub-abc123.r2.dev/example.com/llms.txt
Error Handling
WebSocket errors should be handled gracefully:
ws . onerror = ( error ) => {
console . error ( 'WebSocket error:' , error );
// Show user-friendly error message
};
ws . onclose = ( event ) => {
if ( event . code === 1008 ) {
// Authentication failure
console . error ( 'Authentication failed:' , event . reason );
} else if ( event . code !== 1000 ) {
// Abnormal closure
console . error ( 'Connection closed unexpectedly:' , event . code );
}
};
WebSocket connections have a timeout. If the crawl takes longer than the configured timeout (typically 5-10 minutes), the connection will be closed. Consider implementing reconnection logic for long-running crawls.
Connection Lifecycle
Authentication
Client obtains a JWT token or uses API key directly
Connection
WebSocket connection established with authentication in query params
Request
Client sends JSON payload with crawl configuration
Streaming
Server streams log messages, progress updates, and status information
Result
Complete llms.txt content is sent when crawling completes
URL
Hosted CDN URL is sent after successful upload
Closure
Connection closes gracefully after all data is transmitted
Best Practices
Use JWT tokens in production
JWT tokens expire after 5 minutes, limiting the damage if intercepted. Generate fresh tokens for each crawl session.
If logs arrive faster than they can be displayed, buffer them and render in batches to avoid UI performance issues.
For long-running crawls, implement exponential backoff reconnection logic to handle temporary network issues.
If the connection drops before completion, you may receive partial results. Store intermediate data and allow resume functionality.
Next Steps
Intelligent Crawling Learn how the BFS crawler discovers and extracts content
Auto Updates Set up scheduled recrawls to keep content fresh