Cron Recrawl Trigger

Overview

The cron recrawl endpoint triggers automatic recrawls of all sites that are due for updates. This is an internal endpoint designed to be called by scheduled tasks (AWS Lambda, GitHub Actions, cron jobs, etc.) to maintain up-to-date llms.txt files for enrolled sites.

This endpoint runs recrawls in the background and returns immediately. It does not wait for crawls to complete.

Endpoint

POST /internal/cron/recrawl

Authentication

This endpoint requires authentication via the X-Cron-Secret header:

X-Cron-Secret

string

required

Secret token configured in the backend’s CRON_SECRET environment variable. This prevents unauthorized triggering of recrawls.

Request

No request body is required. Authentication is handled entirely through the header.

Example Request

curl -X POST https://api.example.com/internal/cron/recrawl \
  -H "X-Cron-Secret: your-cron-secret-here"

const response = await fetch('https://api.example.com/internal/cron/recrawl', {
  method: 'POST',
  headers: {
    'X-Cron-Secret': process.env.CRON_SECRET
  }
});

const data = await response.json();
console.log(data.message);

import httpx

response = httpx.post(
    "https://api.example.com/internal/cron/recrawl",
    headers={"X-Cron-Secret": "your-cron-secret"}
)

print(response.json())

Response

Success Response (200)

status

string

Always "triggered" when the recrawl background task is successfully queued.

message

string

Human-readable confirmation message.

{
  "status": "triggered",
  "message": "Recrawl started in background"
}

The endpoint returns immediately after queuing the background task. It does not wait for recrawls to complete or report their results.

Error Response (401)

Returned when the cron secret is missing, invalid, or doesn’t match the configured value.

{
  "detail": "Unauthorized"
}

How It Works

1. Site Selection

The recrawl process:

Queries the crawl_sites table for sites where next_crawl_at <= NOW()
Retrieves site configuration (max_pages, desc_length, recrawl_interval_minutes)
Processes each site sequentially

2. Change Detection

For each site, the system:

Checks sentinel URL (typically sitemap.xml) for changes
Compares hash of new content with stored latest_llms_hash
Skips crawl if content hasn’t changed (optimization)
Full recrawl if changes detected or sentinel unavailable

3. Scheduling

After each check:

Content unchanged: Updates next_crawl_at based on recrawl_interval_minutes
Content changed: Regenerates llms.txt, uploads to R2, updates database
Adaptive scheduling: Adjusts interval based on change frequency (future feature)

Background Task

The recrawl logic is implemented in /backend/main.py:37-43 as a FastAPI background task:

async def run_recrawl_in_background():
    try:
        print("[RECRAWL] Starting background recrawl...")
        results = await recrawl_due_sites()
        print(f"[RECRAWL] Completed: {results}")
    except Exception as e:
        print(f"[RECRAWL] Error: {e}")

The actual recrawl implementation is in /backend/recrawl.py.

Scheduling Examples

AWS Lambda + EventBridge

Lambda Function (lambda_handler.py):

import httpx
import os

def lambda_handler(event, context):
    response = httpx.post(
        os.environ['API_URL'] + '/internal/cron/recrawl',
        headers={'X-Cron-Secret': os.environ['CRON_SECRET']},
        timeout=30
    )
    
    return {
        'statusCode': response.status_code,
        'body': response.text
    }

EventBridge Rule (Terraform):

resource "aws_cloudwatch_event_rule" "recrawl" {
  name                = "llmstxt-recrawl"
  description         = "Trigger recrawl every 6 hours"
  schedule_expression = "rate(6 hours)"
}

resource "aws_cloudwatch_event_target" "lambda" {
  rule      = aws_cloudwatch_event_rule.recrawl.name
  target_id = "RecrawlLambda"
  arn       = aws_lambda_function.recrawl.arn
}

GitHub Actions

name: Scheduled Recrawl

on:
  schedule:
    # Run every 6 hours
    - cron: '0 */6 * * *'
  workflow_dispatch: # Allow manual triggers

jobs:
  recrawl:
    runs-on: ubuntu-latest
    steps:
      - name: Trigger Recrawl
        run: |
          curl -X POST ${{ secrets.API_URL }}/internal/cron/recrawl \
            -H "X-Cron-Secret: ${{ secrets.CRON_SECRET }}"

Vercel Cron

vercel.json:

{
  "crons": [{
    "path": "/api/trigger-recrawl",
    "schedule": "0 */6 * * *"
  }]
}

pages/api/trigger-recrawl.ts:

import type { NextApiRequest, NextApiResponse } from 'next';

export default async function handler(
  req: NextApiRequest,
  res: NextApiResponse
) {
  // Verify Vercel cron secret
  if (req.headers.authorization !== `Bearer ${process.env.CRON_SECRET}`) {
    return res.status(401).json({ error: 'Unauthorized' });
  }

  const response = await fetch(
    `${process.env.BACKEND_URL}/internal/cron/recrawl`,
    {
      method: 'POST',
      headers: { 'X-Cron-Secret': process.env.CRON_SECRET! }
    }
  );

  const data = await response.json();
  res.status(response.status).json(data);
}

Traditional Cron

# /etc/crontab
# Run every 6 hours
0 */6 * * * curl -X POST https://api.example.com/internal/cron/recrawl \
  -H "X-Cron-Secret: $CRON_SECRET"

Configuration

Environment Variables

# Backend .env
CRON_SECRET=your-secure-cron-secret-here

# Database connection (required for recrawls)
SUPABASE_URL=https://xxx.supabase.co
SUPABASE_KEY=your-anon-key

# Storage (required for uploading updated llms.txt)
R2_ENDPOINT=https://xxx.r2.cloudflarestorage.com
R2_ACCESS_KEY=your-access-key
R2_SECRET_KEY=your-secret-key
R2_BUCKET=llms-txt
R2_PUBLIC_DOMAIN=https://pub-xxx.r2.dev

Generate a secure cron secret:

openssl rand -base64 32

Recrawl Intervals

When users enable auto-update via the WebSocket endpoint, they can specify:

Default: 10080 minutes (7 days)
Common values:
- 360 minutes (6 hours)
- 1440 minutes (1 day)
- 10080 minutes (7 days)

The cron job should run more frequently than the shortest interval you want to support.

Monitoring

Check logs to monitor recrawl status:

# AWS CloudWatch
aws logs tail /ecs/llmstxt-api --follow --filter-pattern "RECRAWL"

# Docker logs
docker logs -f llmstxt-backend | grep RECRAWL

Expected log output:

[RECRAWL] Starting background recrawl...
[RECRAWL] Completed: {'checked': 15, 'updated': 3, 'skipped': 12, 'errors': 0}

Database Schema

The endpoint relies on the crawl_sites table structure:

CREATE TABLE crawl_sites (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    base_url TEXT UNIQUE NOT NULL,
    recrawl_interval_minutes INTEGER DEFAULT 10080,
    max_pages INTEGER DEFAULT 50,
    desc_length INTEGER DEFAULT 500,
    last_crawled_at TIMESTAMP WITH TIME ZONE,
    next_crawl_at TIMESTAMP WITH TIME ZONE,
    latest_llms_hash TEXT,
    latest_llms_url TEXT,
    sentinel_url TEXT,
    sitemap_newest_lastmod TIMESTAMP WITH TIME ZONE,
    avg_change_interval_minutes FLOAT,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

Key fields:

next_crawl_at: Determines if site is due for recrawl
latest_llms_hash: Used for change detection
sentinel_url: Quick check endpoint (usually sitemap.xml)

Error Codes

Status Code	Description	Reason
200	Success	Recrawl task queued successfully
401	Unauthorized	Missing or invalid `X-Cron-Secret` header
503	Service Unavailable	Database connection failed (rare)

Performance Considerations

Background Processing: Returns immediately, doesn’t block
Sequential Crawling: Processes sites one at a time to manage resources
Smart Skipping: Avoids full crawls when content unchanged
Timeout Handling: Long-running crawls may timeout; monitor logs

Best Practices

Run frequently: Schedule every 1-6 hours to ensure timely updates
Monitor logs: Set up alerts for recrawl errors
Secure the secret: Use environment variables, never commit to git
Idempotent calls: Safe to call multiple times; won’t duplicate work
Database backups: Ensure Supabase backups are enabled

WebSocket Crawl - Enable auto-update when generating llms.txt
Webhooks - Trigger immediate recrawl for specific site

Endpoints

Backend Modules

Overview

Endpoint

Authentication

Request

Example Request

Response

Success Response (200)

Error Response (401)

How It Works

1. Site Selection

2. Change Detection

3. Scheduling

Background Task

Scheduling Examples

AWS Lambda + EventBridge

GitHub Actions

Vercel Cron

Traditional Cron

Configuration

Environment Variables

Recrawl Intervals

Monitoring

Database Schema

Error Codes

Performance Considerations

Best Practices

Build docs developers (and LLMs) love

Endpoints

Backend Modules

​Overview

​Endpoint

​Authentication

​Request

​Example Request

​Response

​Success Response (200)

​Error Response (401)

​How It Works

​1. Site Selection

​2. Change Detection

​3. Scheduling

​Background Task

​Scheduling Examples

​AWS Lambda + EventBridge

​GitHub Actions

​Vercel Cron

​Traditional Cron

​Configuration

​Environment Variables

​Recrawl Intervals

​Monitoring

​Database Schema

​Error Codes

​Performance Considerations

​Best Practices

​Related Endpoints

Build docs developers (and LLMs) love

Overview

Endpoint

Authentication

Request

Example Request

Response

Success Response (200)

Error Response (401)

How It Works

1. Site Selection

2. Change Detection

3. Scheduling

Background Task

Scheduling Examples

AWS Lambda + EventBridge

GitHub Actions

Vercel Cron

Traditional Cron

Configuration

Environment Variables

Recrawl Intervals

Monitoring

Database Schema

Error Codes

Performance Considerations

Best Practices

Related Endpoints