Formatter

The formatter module converts crawled page data into the llms.txt markdown format. It handles URL cleaning, markdown file detection, content organization by sections, and automatic tagging.

Overview

The formatter takes a list of PageInfo objects and generates a structured markdown file following the llms.txt specification:

Site title and summary from homepage
Pages organized by URL path sections
Optional markdown file linking (.md variants)
Automatic content tagging
Primary/secondary content separation

Core Functions

format_llms_txt()

Generates the complete llms.txt formatted output.

def format_llms_txt(
    base_url: str,
    pages: list[PageInfo],
    md_url_map: Dict[str, str] = None
) -> str

base_url

str

required

The base URL of the crawled site

pages

list[PageInfo]

required

List of crawled pages (first page should be homepage)

md_url_map

Dict[str, str]

default:"None"

Optional mapping from HTML URLs to markdown URLs

output

str

Formatted llms.txt content as a string

Output Format:

# Site Title

> Site summary from homepage

## Section Name

- [Page Title](url): Description with tags
- [Another Page](url): Description

## Optional

- [Privacy Policy](url): Privacy information
- [Terms of Service](url): Legal terms

URL Processing

clean_url()

Removes query parameters and fragments from URLs.

def clean_url(url: str) -> str

url

str

required

URL to clean

clean_url

str

URL with only scheme, netloc, and path

Example:

clean_url("https://example.com/docs?v=2#section")
# Returns: "https://example.com/docs"

get_md_url()

Converts an HTML URL to its potential markdown equivalent.

def get_md_url(url: str) -> str

url

str

required

Original URL

md_url

str

Converted markdown URL path

Conversion Rules:

page.html → page.html.md
/docs/ → /docs/index.html.md
/about → /about.md

Example:

get_md_url("https://example.com/docs/")
# Returns: "https://example.com/docs/index.html.md"

get_md_url("https://example.com/guide.html")
# Returns: "https://example.com/guide.html.md"

check_md_exists()

Checks if a markdown version of a URL exists.

async def check_md_exists(url: str, timeout: float = 5.0) -> bool

url

str

required

URL to check

timeout

float

default:"5.0"

Request timeout in seconds

exists

bool

True if markdown version returns 200

get_md_url_map()

Builds a mapping of HTML to markdown URLs for all pages.

async def get_md_url_map(pages: list[PageInfo]) -> Dict[str, str]

pages

list[PageInfo]

required

List of pages to check

md_map

Dict[str, str]

Mapping from clean HTML URLs to markdown URLs (or original if no .md found)

Behavior:

Sends HEAD requests concurrently for all pages
Checks for markdown content types
Falls back to original URL if no markdown exists
Uses asyncio.gather for parallel requests

Text Processing

truncate()

Truncates text to specified length with ellipsis.

def truncate(text: str, length: int = 150) -> str

text

str

required

Text to truncate

length

int

default:"150"

Maximum character length

truncated

str

Text truncated with ”…” if exceeded length, otherwise original

get_site_title()

Extracts site title from homepage, with fallback to domain name.

def get_site_title(homepage: PageInfo, base_url: str) -> str

homepage

PageInfo

required

Homepage page info

base_url

str

required

Site base URL

title

str

Site title (max 80 chars)

Logic:

Uses homepage title if meaningful
Falls back to cleaned domain name for generic titles (“Home”, “Welcome”, “Index”)
Truncated to 80 characters

get_summary()

Extracts site summary from homepage.

def get_summary(homepage: PageInfo) -> str

homepage

PageInfo

required

Homepage page info

summary

str

Site summary (max 200 chars)

Precedence:

Homepage description
Homepage snippet
“No description available”

Section Processing

clean_section_name()

Cleanlifies section names for display.

def clean_section_name(name: str) -> str

name

str

required

Raw section name from URL path

clean_name

str

Capitalized, human-readable section name

Transformations:

Replaces hyphens and underscores with spaces
Capitalizes words
Uppercases known abbreviations (API, REST, GraphQL, SDK, CLI, UI, UX, FAQ, RSS)
Defaults to “Main” for empty names

Examples:

clean_section_name("api-reference")
# Returns: "API Reference"

clean_section_name("getting_started")
# Returns: "Getting Started"

clean_section_name("")
# Returns: "Main"

is_secondary_section()

Determines if a section is secondary/optional content.

def is_secondary_section(section_name: str) -> bool

section_name

str

required

Section name to check

is_secondary

bool

True if section matches secondary patterns

Secondary Patterns:

Legal: privacy, terms, legal, cookie, disclaimer
Meta: sitemap, changelog, release
Community: contributing, code-of-conduct, governance, license
Company: about, team, career, job, contact, company
Social: twitter, github, linkedin, facebook, social
Archive: archive, old, legacy, deprecated

Usage Examples

Basic Formatting

from backend.formatter import format_llms_txt
from backend.crawler import PageInfo

pages = [
    PageInfo(
        url="https://example.com",
        title="Example Site",
        description="A great example site",
        snippet="Welcome to our example site..."
    ),
    PageInfo(
        url="https://example.com/docs/intro",
        title="Introduction",
        description="Getting started guide",
        snippet="Learn the basics..."
    ),
    PageInfo(
        url="https://example.com/api/reference",
        title="API Reference",
        description="Complete API documentation",
        snippet="API endpoints..."
    )
]

output = format_llms_txt("https://example.com", pages)
print(output)

Output:

# Example Site

> A great example site

## Docs

- [Introduction](https://example.com/docs/intro): Getting started guide

## API

- [API Reference](https://example.com/api/reference): Complete API documentation

With Markdown URL Mapping

from backend.formatter import format_llms_txt, get_md_url_map

pages = await crawler.run()
md_url_map = await get_md_url_map(pages)

output = format_llms_txt(
    base_url="https://docs.example.com",
    pages=pages,
    md_url_map=md_url_map
)

# Links will point to .md files where available

Section Organization

pages = [
    PageInfo(url="https://ex.com", title="Home", description="Main", snippet="..."),
    PageInfo(url="https://ex.com/docs/guide", title="Guide", description="Guide", snippet="..."),
    PageInfo(url="https://ex.com/api/auth", title="Auth", description="Auth", snippet="..."),
    PageInfo(url="https://ex.com/privacy", title="Privacy", description="Privacy", snippet="..."),
]

output = format_llms_txt("https://ex.com", pages)

# Output has primary sections (Docs, API) followed by:
# ## Optional
# - [Privacy](https://ex.com/privacy): Privacy

Content Tagging

The formatter integrates with the tagger module:

from tagger import assign_tags, format_description_with_tags

tags = assign_tags(page, section_name=section)
desc_with_tags = format_description_with_tags(desc, tags)

Tags appear inline in descriptions:

- [API Auth](url): Authentication endpoints #api #auth

crawler - Provides PageInfo objects
tagger - Assigns and formats content tags
storage - Saves formatted output to R2

Endpoints

Backend Modules

Overview

Core Functions

format_llms_txt()

URL Processing

clean_url()

get_md_url()

check_md_exists()

get_md_url_map()

Text Processing

truncate()

get_site_title()

get_summary()

Section Processing

clean_section_name()

is_secondary_section()

Usage Examples

Basic Formatting

With Markdown URL Mapping

Section Organization

Content Tagging

Build docs developers (and LLMs) love

Endpoints

Backend Modules

​Overview

​Core Functions

​format_llms_txt()

​URL Processing

​clean_url()

​get_md_url()

​check_md_exists()

​get_md_url_map()

​Text Processing

​truncate()

​get_site_title()

​get_summary()

​Section Processing

​clean_section_name()

​is_secondary_section()

​Usage Examples

​Basic Formatting

​With Markdown URL Mapping

​Section Organization

​Content Tagging

​Related Modules

Build docs developers (and LLMs) love

Overview

Core Functions

format_llms_txt()

URL Processing

clean_url()

get_md_url()

check_md_exists()

get_md_url_map()

Text Processing

truncate()

get_site_title()

get_summary()

Section Processing

clean_section_name()

is_secondary_section()

Usage Examples

Basic Formatting

With Markdown URL Mapping

Section Organization

Content Tagging

Related Modules