Overview
The cron recrawl endpoint triggers automatic recrawls of all sites that are due for updates. This is an internal endpoint designed to be called by scheduled tasks (AWS Lambda, GitHub Actions, cron jobs, etc.) to maintain up-to-date llms.txt files for enrolled sites.
This endpoint runs recrawls in the background and returns immediately. It does not wait for crawls to complete.
Endpoint
POST /internal/cron/recrawl
Authentication
This endpoint requires authentication via the X-Cron-Secret header:
Secret token configured in the backend’s CRON_SECRET environment variable. This prevents unauthorized triggering of recrawls.
Request
No request body is required. Authentication is handled entirely through the header.
Example Request
curl -X POST https://api.example.com/internal/cron/recrawl \
-H "X-Cron-Secret: your-cron-secret-here"
const response = await fetch('https://api.example.com/internal/cron/recrawl', {
method: 'POST',
headers: {
'X-Cron-Secret': process.env.CRON_SECRET
}
});
const data = await response.json();
console.log(data.message);
import httpx
response = httpx.post(
"https://api.example.com/internal/cron/recrawl",
headers={"X-Cron-Secret": "your-cron-secret"}
)
print(response.json())
Response
Success Response (200)
Always "triggered" when the recrawl background task is successfully queued.
Human-readable confirmation message.
{
"status": "triggered",
"message": "Recrawl started in background"
}
The endpoint returns immediately after queuing the background task. It does not wait for recrawls to complete or report their results.
Error Response (401)
Returned when the cron secret is missing, invalid, or doesn’t match the configured value.
{
"detail": "Unauthorized"
}
How It Works
1. Site Selection
The recrawl process:
- Queries the
crawl_sites table for sites where next_crawl_at <= NOW()
- Retrieves site configuration (
max_pages, desc_length, recrawl_interval_minutes)
- Processes each site sequentially
2. Change Detection
For each site, the system:
- Checks sentinel URL (typically
sitemap.xml) for changes
- Compares hash of new content with stored
latest_llms_hash
- Skips crawl if content hasn’t changed (optimization)
- Full recrawl if changes detected or sentinel unavailable
3. Scheduling
After each check:
- Content unchanged: Updates
next_crawl_at based on recrawl_interval_minutes
- Content changed: Regenerates
llms.txt, uploads to R2, updates database
- Adaptive scheduling: Adjusts interval based on change frequency (future feature)
Background Task
The recrawl logic is implemented in /backend/main.py:37-43 as a FastAPI background task:
async def run_recrawl_in_background():
try:
print("[RECRAWL] Starting background recrawl...")
results = await recrawl_due_sites()
print(f"[RECRAWL] Completed: {results}")
except Exception as e:
print(f"[RECRAWL] Error: {e}")
The actual recrawl implementation is in /backend/recrawl.py.
Scheduling Examples
AWS Lambda + EventBridge
Lambda Function (lambda_handler.py):
import httpx
import os
def lambda_handler(event, context):
response = httpx.post(
os.environ['API_URL'] + '/internal/cron/recrawl',
headers={'X-Cron-Secret': os.environ['CRON_SECRET']},
timeout=30
)
return {
'statusCode': response.status_code,
'body': response.text
}
EventBridge Rule (Terraform):
resource "aws_cloudwatch_event_rule" "recrawl" {
name = "llmstxt-recrawl"
description = "Trigger recrawl every 6 hours"
schedule_expression = "rate(6 hours)"
}
resource "aws_cloudwatch_event_target" "lambda" {
rule = aws_cloudwatch_event_rule.recrawl.name
target_id = "RecrawlLambda"
arn = aws_lambda_function.recrawl.arn
}
GitHub Actions
name: Scheduled Recrawl
on:
schedule:
# Run every 6 hours
- cron: '0 */6 * * *'
workflow_dispatch: # Allow manual triggers
jobs:
recrawl:
runs-on: ubuntu-latest
steps:
- name: Trigger Recrawl
run: |
curl -X POST ${{ secrets.API_URL }}/internal/cron/recrawl \
-H "X-Cron-Secret: ${{ secrets.CRON_SECRET }}"
Vercel Cron
vercel.json:
{
"crons": [{
"path": "/api/trigger-recrawl",
"schedule": "0 */6 * * *"
}]
}
pages/api/trigger-recrawl.ts:
import type { NextApiRequest, NextApiResponse } from 'next';
export default async function handler(
req: NextApiRequest,
res: NextApiResponse
) {
// Verify Vercel cron secret
if (req.headers.authorization !== `Bearer ${process.env.CRON_SECRET}`) {
return res.status(401).json({ error: 'Unauthorized' });
}
const response = await fetch(
`${process.env.BACKEND_URL}/internal/cron/recrawl`,
{
method: 'POST',
headers: { 'X-Cron-Secret': process.env.CRON_SECRET! }
}
);
const data = await response.json();
res.status(response.status).json(data);
}
Traditional Cron
# /etc/crontab
# Run every 6 hours
0 */6 * * * curl -X POST https://api.example.com/internal/cron/recrawl \
-H "X-Cron-Secret: $CRON_SECRET"
Configuration
Environment Variables
# Backend .env
CRON_SECRET=your-secure-cron-secret-here
# Database connection (required for recrawls)
SUPABASE_URL=https://xxx.supabase.co
SUPABASE_KEY=your-anon-key
# Storage (required for uploading updated llms.txt)
R2_ENDPOINT=https://xxx.r2.cloudflarestorage.com
R2_ACCESS_KEY=your-access-key
R2_SECRET_KEY=your-secret-key
R2_BUCKET=llms-txt
R2_PUBLIC_DOMAIN=https://pub-xxx.r2.dev
Generate a secure cron secret:
Recrawl Intervals
When users enable auto-update via the WebSocket endpoint, they can specify:
- Default: 10080 minutes (7 days)
- Common values:
- 360 minutes (6 hours)
- 1440 minutes (1 day)
- 10080 minutes (7 days)
The cron job should run more frequently than the shortest interval you want to support.
Monitoring
Check logs to monitor recrawl status:
# AWS CloudWatch
aws logs tail /ecs/llmstxt-api --follow --filter-pattern "RECRAWL"
# Docker logs
docker logs -f llmstxt-backend | grep RECRAWL
Expected log output:
[RECRAWL] Starting background recrawl...
[RECRAWL] Completed: {'checked': 15, 'updated': 3, 'skipped': 12, 'errors': 0}
Database Schema
The endpoint relies on the crawl_sites table structure:
CREATE TABLE crawl_sites (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
base_url TEXT UNIQUE NOT NULL,
recrawl_interval_minutes INTEGER DEFAULT 10080,
max_pages INTEGER DEFAULT 50,
desc_length INTEGER DEFAULT 500,
last_crawled_at TIMESTAMP WITH TIME ZONE,
next_crawl_at TIMESTAMP WITH TIME ZONE,
latest_llms_hash TEXT,
latest_llms_url TEXT,
sentinel_url TEXT,
sitemap_newest_lastmod TIMESTAMP WITH TIME ZONE,
avg_change_interval_minutes FLOAT,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
Key fields:
next_crawl_at: Determines if site is due for recrawl
latest_llms_hash: Used for change detection
sentinel_url: Quick check endpoint (usually sitemap.xml)
Error Codes
| Status Code | Description | Reason |
|---|
| 200 | Success | Recrawl task queued successfully |
| 401 | Unauthorized | Missing or invalid X-Cron-Secret header |
| 503 | Service Unavailable | Database connection failed (rare) |
- Background Processing: Returns immediately, doesn’t block
- Sequential Crawling: Processes sites one at a time to manage resources
- Smart Skipping: Avoids full crawls when content unchanged
- Timeout Handling: Long-running crawls may timeout; monitor logs
Best Practices
- Run frequently: Schedule every 1-6 hours to ensure timely updates
- Monitor logs: Set up alerts for recrawl errors
- Secure the secret: Use environment variables, never commit to git
- Idempotent calls: Safe to call multiple times; won’t duplicate work
- Database backups: Ensure Supabase backups are enabled
- WebSocket Crawl - Enable auto-update when generating llms.txt
- Webhooks - Trigger immediate recrawl for specific site