Skip to main content
The formatter module converts crawled page data into the llms.txt markdown format. It handles URL cleaning, markdown file detection, content organization by sections, and automatic tagging.

Overview

The formatter takes a list of PageInfo objects and generates a structured markdown file following the llms.txt specification:
  • Site title and summary from homepage
  • Pages organized by URL path sections
  • Optional markdown file linking (.md variants)
  • Automatic content tagging
  • Primary/secondary content separation

Core Functions

format_llms_txt()

Generates the complete llms.txt formatted output.
def format_llms_txt(
    base_url: str,
    pages: list[PageInfo],
    md_url_map: Dict[str, str] = None
) -> str
base_url
str
required
The base URL of the crawled site
pages
list[PageInfo]
required
List of crawled pages (first page should be homepage)
md_url_map
Dict[str, str]
default:"None"
Optional mapping from HTML URLs to markdown URLs
output
str
Formatted llms.txt content as a string
Output Format:
# Site Title

> Site summary from homepage

## Section Name

- [Page Title](url): Description with tags
- [Another Page](url): Description

## Optional

- [Privacy Policy](url): Privacy information
- [Terms of Service](url): Legal terms

URL Processing

clean_url()

Removes query parameters and fragments from URLs.
def clean_url(url: str) -> str
url
str
required
URL to clean
clean_url
str
URL with only scheme, netloc, and path
Example:
clean_url("https://example.com/docs?v=2#section")
# Returns: "https://example.com/docs"

get_md_url()

Converts an HTML URL to its potential markdown equivalent.
def get_md_url(url: str) -> str
url
str
required
Original URL
md_url
str
Converted markdown URL path
Conversion Rules:
  • page.htmlpage.html.md
  • /docs//docs/index.html.md
  • /about/about.md
Example:
get_md_url("https://example.com/docs/")
# Returns: "https://example.com/docs/index.html.md"

get_md_url("https://example.com/guide.html")
# Returns: "https://example.com/guide.html.md"

check_md_exists()

Checks if a markdown version of a URL exists.
async def check_md_exists(url: str, timeout: float = 5.0) -> bool
url
str
required
URL to check
timeout
float
default:"5.0"
Request timeout in seconds
exists
bool
True if markdown version returns 200

get_md_url_map()

Builds a mapping of HTML to markdown URLs for all pages.
async def get_md_url_map(pages: list[PageInfo]) -> Dict[str, str]
pages
list[PageInfo]
required
List of pages to check
md_map
Dict[str, str]
Mapping from clean HTML URLs to markdown URLs (or original if no .md found)
Behavior:
  • Sends HEAD requests concurrently for all pages
  • Checks for markdown content types
  • Falls back to original URL if no markdown exists
  • Uses asyncio.gather for parallel requests

Text Processing

truncate()

Truncates text to specified length with ellipsis.
def truncate(text: str, length: int = 150) -> str
text
str
required
Text to truncate
length
int
default:"150"
Maximum character length
truncated
str
Text truncated with ”…” if exceeded length, otherwise original

get_site_title()

Extracts site title from homepage, with fallback to domain name.
def get_site_title(homepage: PageInfo, base_url: str) -> str
homepage
PageInfo
required
Homepage page info
base_url
str
required
Site base URL
title
str
Site title (max 80 chars)
Logic:
  • Uses homepage title if meaningful
  • Falls back to cleaned domain name for generic titles (“Home”, “Welcome”, “Index”)
  • Truncated to 80 characters

get_summary()

Extracts site summary from homepage.
def get_summary(homepage: PageInfo) -> str
homepage
PageInfo
required
Homepage page info
summary
str
Site summary (max 200 chars)
Precedence:
  1. Homepage description
  2. Homepage snippet
  3. “No description available”

Section Processing

clean_section_name()

Cleanlifies section names for display.
def clean_section_name(name: str) -> str
name
str
required
Raw section name from URL path
clean_name
str
Capitalized, human-readable section name
Transformations:
  • Replaces hyphens and underscores with spaces
  • Capitalizes words
  • Uppercases known abbreviations (API, REST, GraphQL, SDK, CLI, UI, UX, FAQ, RSS)
  • Defaults to “Main” for empty names
Examples:
clean_section_name("api-reference")
# Returns: "API Reference"

clean_section_name("getting_started")
# Returns: "Getting Started"

clean_section_name("")
# Returns: "Main"

is_secondary_section()

Determines if a section is secondary/optional content.
def is_secondary_section(section_name: str) -> bool
section_name
str
required
Section name to check
is_secondary
bool
True if section matches secondary patterns
Secondary Patterns:
  • Legal: privacy, terms, legal, cookie, disclaimer
  • Meta: sitemap, changelog, release
  • Community: contributing, code-of-conduct, governance, license
  • Company: about, team, career, job, contact, company
  • Social: twitter, github, linkedin, facebook, social
  • Archive: archive, old, legacy, deprecated

Usage Examples

Basic Formatting

from backend.formatter import format_llms_txt
from backend.crawler import PageInfo

pages = [
    PageInfo(
        url="https://example.com",
        title="Example Site",
        description="A great example site",
        snippet="Welcome to our example site..."
    ),
    PageInfo(
        url="https://example.com/docs/intro",
        title="Introduction",
        description="Getting started guide",
        snippet="Learn the basics..."
    ),
    PageInfo(
        url="https://example.com/api/reference",
        title="API Reference",
        description="Complete API documentation",
        snippet="API endpoints..."
    )
]

output = format_llms_txt("https://example.com", pages)
print(output)
Output:
# Example Site

> A great example site

## Docs

- [Introduction](https://example.com/docs/intro): Getting started guide

## API

- [API Reference](https://example.com/api/reference): Complete API documentation

With Markdown URL Mapping

from backend.formatter import format_llms_txt, get_md_url_map

pages = await crawler.run()
md_url_map = await get_md_url_map(pages)

output = format_llms_txt(
    base_url="https://docs.example.com",
    pages=pages,
    md_url_map=md_url_map
)

# Links will point to .md files where available

Section Organization

pages = [
    PageInfo(url="https://ex.com", title="Home", description="Main", snippet="..."),
    PageInfo(url="https://ex.com/docs/guide", title="Guide", description="Guide", snippet="..."),
    PageInfo(url="https://ex.com/api/auth", title="Auth", description="Auth", snippet="..."),
    PageInfo(url="https://ex.com/privacy", title="Privacy", description="Privacy", snippet="..."),
]

output = format_llms_txt("https://ex.com", pages)

# Output has primary sections (Docs, API) followed by:
# ## Optional
# - [Privacy](https://ex.com/privacy): Privacy

Content Tagging

The formatter integrates with the tagger module:
from tagger import assign_tags, format_description_with_tags

tags = assign_tags(page, section_name=section)
desc_with_tags = format_description_with_tags(desc, tags)
Tags appear inline in descriptions:
- [API Auth](url): Authentication endpoints #api #auth
  • crawler - Provides PageInfo objects
  • tagger - Assigns and formats content tags
  • storage - Saves formatted output to R2

Build docs developers (and LLMs) love