The llms.txt Generator supports automatic updates to keep your generated files synchronized with website changes. When enabled, sites are enrolled in a scheduled recrawl system that monitors for changes and regenerates content automatically.
Auto-updates run on AWS Lambda triggered by EventBridge (CloudWatch Events) every 6 hours by default.
CREATE TABLE crawl_sites ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), base_url TEXT UNIQUE NOT NULL, max_pages INTEGER DEFAULT 50, desc_length INTEGER DEFAULT 500, recrawl_interval_minutes INTEGER DEFAULT 10080, -- Crawl results last_crawled_at TIMESTAMP WITH TIME ZONE, last_changed_at TIMESTAMP WITH TIME ZONE, latest_llms_hash TEXT, latest_llms_url TEXT, -- Change detection sentinel_url TEXT, sentinel_last_modified TIMESTAMP WITH TIME ZONE, avg_change_interval_minutes FLOAT, -- Scheduling next_crawl_at TIMESTAMP WITH TIME ZONE, -- Metadata created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW());CREATE INDEX idx_next_crawl ON crawl_sites (next_crawl_at) WHERE next_crawl_at IS NOT NULL;
The system includes scaffolding for adaptive scheduling based on observed change frequency:
backend/scheduling.py
MIN_INTERVAL_MINUTES = 60 # 1 hour minimumMAX_INTERVAL_MINUTES = 10080 # 7 days maximumEWMA_ALPHA = 0.5 # Moderate adaptation ratedef compute_next_crawl( site_recrawl_interval: int, avg_change_interval: float | None, content_changed: bool, last_changed_at: datetime | None, now: datetime, adaptive_enabled: bool = False) -> tuple[datetime, float | None, datetime | None]: if not adaptive_enabled: interval_minutes = site_recrawl_interval return ( now + timedelta(minutes=interval_minutes), avg_change_interval, last_changed_at if not content_changed else now ) # Track average change interval using EWMA if avg_change_interval is None: avg_change_interval = float(site_recrawl_interval) if content_changed and last_changed_at: minutes_since_change = (now - last_changed_at).total_seconds() / 60 avg_change_interval = ( EWMA_ALPHA * minutes_since_change + (1 - EWMA_ALPHA) * avg_change_interval ) # Use learned interval, clamped to min/max effective_interval = max( MIN_INTERVAL_MINUTES, min(MAX_INTERVAL_MINUTES, avg_change_interval) ) return ( now + timedelta(minutes=effective_interval), avg_change_interval, now if content_changed else last_changed_at )
Adaptive scheduling is currently disabled (adaptive_enabled=False). When enabled, sites that change frequently are crawled more often, while stable sites are crawled less frequently.
Check: next_crawl_at in database is in the pastCheck: Lambda function is being triggered by EventBridgeCheck: Site’s sitemap hasn’t indicated changes (review sentinel_last_modified)
All crawls failing
Check: Lambda has correct environment variables (API keys, secrets)Check: Lambda has network access to external services (Supabase, R2, target sites)Check: Lambda timeout is sufficient (recommend 300 seconds)
Content not updating despite changes
Cause: Sitemap <lastmod> not updated, or hash collision (extremely rare)Solution: Manually trigger recrawl via webhook, or update next_crawl_at to force immediate crawl
High costs
Cause: Too many enrolled sites, or intervals too shortSolution: Increase recrawl intervals, or remove inactive sites from database