Skip to main content

Quick Start Guide

Get the llms.txt Generator running locally in under 5 minutes. This guide will walk you through setting up both the backend and frontend for development.

Prerequisites

Before you begin, ensure you have the following installed:

Python 3.11+

Required for the FastAPI backend

Node.js 20+

Required for the Next.js frontend

Git

For cloning the repository
Docker is optional but recommended for simplified deployment. See the Docker Setup section below.

Installation

1

Clone the Repository

Clone the project to your local machine:
git clone <your-repo-url>
cd llmstxt
2

Backend Setup

Set up the Python environment and install dependencies:
cd backend
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
The requirements.txt includes:
fastapi
uvicorn[standard]
httpx
beautifulsoup4
pydantic
pydantic-settings
python-dotenv
boto3
pytest
pytest-asyncio
supabase==2.10.0
playwright
PyJWT[crypto]
Install Playwright browsers:
playwright install chromium
3

Backend Environment Configuration

Create your backend environment file:
cp .env.example .env
Edit .env with your configuration:
# CORS Configuration
CORS_ORIGINS=http://localhost:3000

# Cloudflare R2 Storage (Required)
R2_ENDPOINT=https://your-account-id.r2.cloudflarestorage.com
R2_ACCESS_KEY=your-access-key
R2_SECRET_KEY=your-secret-key
R2_BUCKET=llms-txt
R2_PUBLIC_DOMAIN=https://your-public-domain.com

# Supabase Database (Required)
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your-anon-key

# Cron Secret for scheduled updates
CRON_SECRET=your-secret-token

# Brightdata Proxy (Optional - for JS-heavy sites)
BRIGHTDATA_API_KEY=your-customer-id-here
BRIGHTDATA_ENABLED=true
BRIGHTDATA_ZONE=scraping_browser1
BRIGHTDATA_PASSWORD=your-zone-password-here
At minimum, you need to configure R2 storage and Supabase. The Brightdata proxy is optional and only needed for JavaScript-heavy websites.
4

Frontend Setup

In a new terminal, navigate to the frontend directory:
cd frontend
npm install
Create the frontend environment file:
cp .env.example .env.local
Edit .env.local:
NEXT_PUBLIC_WS_URL=ws://localhost:8000/ws/crawl
5

Start the Servers

Start both servers in separate terminals:
cd backend
source venv/bin/activate
uvicorn main:app --reload --port 8000
You should see:
  • Backend: INFO: Uvicorn running on http://127.0.0.1:8000
  • Frontend: Ready on http://localhost:3000
6

Access the Application

Open your browser and navigate to:

Generate Your First llms.txt

Now that everything is running, let’s generate your first llms.txt file:
1

Open the Web Interface

Navigate to http://localhost:3000 in your browser.
2

Enter a Website URL

Enter a website URL you want to crawl. For testing, try:
  • https://docs.python.org
  • https://fastapi.tiangolo.com
  • Your own documentation site
3

Configure Crawl Parameters

Adjust the settings based on your needs:
  • Max Pages: Number of pages to crawl (default: 50)
  • Description Length: Character limit for page excerpts (default: 500)
  • Enable Auto-Update: Schedule periodic recrawls (optional)
  • Recrawl Interval: Minutes between updates (default: 360)
  • LLM Enhancement: AI-powered optimization (optional)
  • Use Brightdata: For JavaScript-heavy sites (optional)
4

Start Crawling

Click “Generate llms.txt” and watch the real-time progress in the log window.You’ll see messages like:
Starting crawl of https://example.com
Crawling page 1/50...
Crawling page 2/50...
Found 25 pages
Checking for .md versions of pages...
Found 3 pages with .md versions
5

Get Your Results

Once complete, you’ll receive:
  • Generated llms.txt content (viewable in browser)
  • Download button for the file
  • Public CDN URL for hosting
  • Copy button for quick sharing

Understanding the WebSocket API

The backend uses WebSockets for real-time communication. Here’s how the protocol works:

Connection

const ws = new WebSocket('ws://localhost:8000/ws/crawl?api_key=YOUR_KEY');

Send Request

{
  "url": "https://example.com",
  "maxPages": 50,
  "descLength": 500,
  "enableAutoUpdate": false,
  "recrawlIntervalMinutes": 360,
  "llmEnhance": false,
  "useBrightdata": false
}

Receive Messages

The server sends different message types:
{
  "type": "log",
  "content": "Crawling page 1/50..."
}

Implementation Example

Here’s the core WebSocket handler from the backend:
@app.websocket("/ws/crawl")
async def websocket_crawl(websocket: WebSocket):
    # Validate API key
    api_key = websocket.query_params.get("api_key")
    if api_key != settings.api_key:
        await websocket.close(code=1008, reason="Unauthorized")
        return

    await websocket.accept()

    try:
        # Receive configuration
        data = await websocket.receive_text()
        payload = json.loads(data)

        url = str(payload['url'])
        max_pages = payload.get('maxPages', 50)
        desc_length = payload.get('descLength', 500)

        # Log function for real-time updates
        async def log(message: str):
            await websocket.send_json({"type": "log", "content": message})

        # Start crawling
        crawler = LLMCrawler(
            url,
            max_pages,
            desc_length,
            log,
            brightdata_enabled=payload.get('useBrightdata', False)
        )
        pages = await crawler.run()

        # Format output
        llms_txt = format_llms_txt(url, pages, md_url_map)

        # Send result
        await websocket.send_json({"type": "result", "content": llms_txt})

        # Save to storage
        hosted_url = await save_llms_txt(url, llms_txt, log)
        await websocket.send_json({"type": "url", "content": hosted_url})

    except Exception as e:
        await websocket.send_json({"type": "error", "content": str(e)})
    finally:
        await websocket.close()

Docker Setup (Optional)

For a simpler setup, use Docker Compose:
1

Configure Environment Files

Create .env files as described in steps 3-4 above.
2

Start Services

docker-compose up -d
This starts both backend and frontend:
version: "3.9"

services:
  backend:
    build: ./backend
    ports:
      - "8000:8000"
    env_file:
      - ./backend/.env
    restart: unless-stopped

  frontend:
    build: ./frontend
    ports:
      - "3000:3000"
    env_file:
      - ./frontend/.env.local
    environment:
      - NEXT_PUBLIC_WS_URL=ws://localhost:8000/ws/crawl
    restart: unless-stopped
    depends_on:
      - backend
3

Access the Application

Same URLs as manual setup:

Troubleshooting

Make sure you’ve activated the virtual environment and installed dependencies:
cd backend
source venv/bin/activate
pip install -r requirements.txt
Install Playwright browsers:
playwright install chromium
Verify:
  1. Backend is running on port 8000
  2. CORS_ORIGINS includes your frontend URL
  3. API key is configured (if required)
  4. Check browser console for error messages
Ensure your R2 credentials are correct:
  • Endpoint URL format: https://<account-id>.r2.cloudflarestorage.com
  • Access key and secret key are valid
  • Bucket exists and is accessible
  • Public domain is configured correctly
Check NEXT_PUBLIC_WS_URL in .env.local:
  • Should be ws://localhost:8000/ws/crawl for local development
  • Use wss:// for production with HTTPS

Next Steps

Configuration Guide

Learn about all configuration options and environment variables

API Reference

Explore the full API documentation and endpoints

Deployment

Deploy to AWS with Terraform for production use

Architecture

Deep dive into system architecture and components
For production deployment, see the Deployment Guide which covers AWS ECS, Lambda, and infrastructure setup with Terraform.

Build docs developers (and LLMs) love