Architecture Overview

System Architecture

The llms.txt Generator is a production-grade distributed system that automatically generates and maintains llms.txt files for websites. The architecture follows a microservices pattern with clear separation between the web frontend, API backend, and automated recrawl services.

Architecture Diagram

┌─────────────────┐
│   Next.js UI    │  (Vercel)
│   WebSocket     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  FastAPI API    │  (AWS ECS Fargate)
│  WebSocket      │  ← CloudFlare Tunnel
└────────┬────────┘
         │
    ┌────┴────────────────┐
    │                     │
    ▼                     ▼
┌──────────┐      ┌──────────────┐
│ Crawler  │      │   Storage    │
│ Playwright      │   Supabase   │
│ Brightdata      │   R2 CDN     │
└──────────┘      └──────────────┘
         ▲
         │
    ┌────┴────┐
    │ Lambda  │  (Scheduled Recrawls)
    │ Cron    │  EventBridge: Every 6h
    └─────────┘

Core Components

Frontend (Next.js)

Hosting: Vercel
Framework: Next.js 15 with TypeScript
Communication: WebSocket API for real-time updates
UI: Tailwind CSS for responsive design

Backend (FastAPI)

Hosting: AWS ECS Fargate containers
Framework: Python 3.11 + FastAPI
Protocol: WebSocket for bidirectional communication
Load Balancing: Application Load Balancer (ALB)

Crawler Engine

Browser Automation: Playwright (Chromium)
JS-Heavy Sites: Brightdata proxy integration
Algorithm: BFS traversal with sitemap detection
Content Extraction: BeautifulSoup4 for HTML parsing

Storage Layer

Database: Supabase (PostgreSQL)
Object Storage: Cloudflare R2
Public CDN: R2 public domain for llms.txt files
Persistence: Crawl metadata and scheduling data

Component Interactions

User-Initiated Crawl Flow

User submits a website URL through the Next.js frontend
WebSocket connection established with FastAPI backend
Crawler engine performs BFS traversal or sitemap-based crawl
Real-time logs streamed back to frontend via WebSocket
Generated llms.txt saved to R2 storage
Public URL returned to user for immediate access
Metadata saved to Supabase for scheduled updates (if enabled)

Automated Recrawl Flow

EventBridge rule triggers Lambda function every 6 hours
Lambda queries Supabase for sites due for recrawl
HTTP POST sent to FastAPI /internal/cron/recrawl endpoint
Background task processes each site sequentially
Updated llms.txt saved to R2, replacing previous version
Metadata updated in Supabase with new timestamp

The system supports both on-demand and scheduled crawling, ensuring llms.txt files stay synchronized with website changes.

Design Principles

Scalability

Containerized deployment on ECS Fargate allows horizontal scaling
Auto-scaling based on CPU/memory utilization
Stateless API design enables multiple backend instances
Queue-based crawling prevents memory overflow on large sites

Reliability

Health checks at ALB and ECS task levels
Automatic failover for unhealthy containers
Retry logic for failed page fetches (3x max pages attempted)
CloudWatch monitoring for logs and metrics

Security

API key authentication for WebSocket endpoints
JWT token support for time-limited access
CORS protection with configurable origins
TLS/SSL encryption for all traffic (HTTPS/WSS)
Secrets management via environment variables

Performance

Concurrent crawling with asyncio for I/O-bound operations
CDN delivery of generated llms.txt files via R2
Sitemap detection for faster discovery of pages
Brightdata integration for JavaScript-heavy sites

Next Steps

Technology Stack

Detailed breakdown of backend, frontend, and infrastructure technologies

Infrastructure

AWS infrastructure components and Terraform configuration

Data Flow

Step-by-step data flow from crawl request to llms.txt generation

Deployment Guide

Complete deployment instructions for production

System Design

System Architecture

Architecture Diagram

Core Components

Frontend (Next.js)

Backend (FastAPI)

Crawler Engine

Storage Layer

Component Interactions

User-Initiated Crawl Flow

Automated Recrawl Flow

Design Principles

Scalability

Reliability

Security

Performance

Next Steps

Technology Stack

Infrastructure

Data Flow

Deployment Guide

Build docs developers (and LLMs) love

System Design

​System Architecture

​Architecture Diagram

​Core Components

Frontend (Next.js)

Backend (FastAPI)

Crawler Engine

Storage Layer

​Component Interactions

​User-Initiated Crawl Flow

​Automated Recrawl Flow

​Design Principles

​Scalability

​Reliability

​Security

​Performance

​Next Steps

Technology Stack

Infrastructure

Data Flow

Deployment Guide

Build docs developers (and LLMs) love

System Architecture

Architecture Diagram

Core Components

Component Interactions

User-Initiated Crawl Flow

Automated Recrawl Flow

Design Principles

Scalability

Reliability

Security

Performance

Next Steps