System Architecture
The llms.txt Generator is a production-grade distributed system that automatically generates and maintainsllms.txt files for websites. The architecture follows a microservices pattern with clear separation between the web frontend, API backend, and automated recrawl services.
Architecture Diagram
Core Components
Frontend (Next.js)
- Hosting: Vercel
- Framework: Next.js 15 with TypeScript
- Communication: WebSocket API for real-time updates
- UI: Tailwind CSS for responsive design
Backend (FastAPI)
- Hosting: AWS ECS Fargate containers
- Framework: Python 3.11 + FastAPI
- Protocol: WebSocket for bidirectional communication
- Load Balancing: Application Load Balancer (ALB)
Crawler Engine
- Browser Automation: Playwright (Chromium)
- JS-Heavy Sites: Brightdata proxy integration
- Algorithm: BFS traversal with sitemap detection
- Content Extraction: BeautifulSoup4 for HTML parsing
Storage Layer
- Database: Supabase (PostgreSQL)
- Object Storage: Cloudflare R2
- Public CDN: R2 public domain for llms.txt files
- Persistence: Crawl metadata and scheduling data
Component Interactions
User-Initiated Crawl Flow
- User submits a website URL through the Next.js frontend
- WebSocket connection established with FastAPI backend
- Crawler engine performs BFS traversal or sitemap-based crawl
- Real-time logs streamed back to frontend via WebSocket
- Generated llms.txt saved to R2 storage
- Public URL returned to user for immediate access
- Metadata saved to Supabase for scheduled updates (if enabled)
Automated Recrawl Flow
- EventBridge rule triggers Lambda function every 6 hours
- Lambda queries Supabase for sites due for recrawl
- HTTP POST sent to FastAPI
/internal/cron/recrawlendpoint - Background task processes each site sequentially
- Updated llms.txt saved to R2, replacing previous version
- Metadata updated in Supabase with new timestamp
The system supports both on-demand and scheduled crawling, ensuring llms.txt files stay synchronized with website changes.
Design Principles
Scalability
- Containerized deployment on ECS Fargate allows horizontal scaling
- Auto-scaling based on CPU/memory utilization
- Stateless API design enables multiple backend instances
- Queue-based crawling prevents memory overflow on large sites
Reliability
- Health checks at ALB and ECS task levels
- Automatic failover for unhealthy containers
- Retry logic for failed page fetches (3x max pages attempted)
- CloudWatch monitoring for logs and metrics
Security
- API key authentication for WebSocket endpoints
- JWT token support for time-limited access
- CORS protection with configurable origins
- TLS/SSL encryption for all traffic (HTTPS/WSS)
- Secrets management via environment variables
Performance
- Concurrent crawling with asyncio for I/O-bound operations
- CDN delivery of generated llms.txt files via R2
- Sitemap detection for faster discovery of pages
- Brightdata integration for JavaScript-heavy sites
Next Steps
Technology Stack
Detailed breakdown of backend, frontend, and infrastructure technologies
Infrastructure
AWS infrastructure components and Terraform configuration
Data Flow
Step-by-step data flow from crawl request to llms.txt generation
Deployment Guide
Complete deployment instructions for production