Skip to main content

Overview

The DocSearch crawler is the backbone of the DocSearch service. It automatically extracts content from your documentation website, transforms it into searchable records, and pushes it to an Algolia index. This enables fast, relevant search results for your users.
DocSearch now leverages the powerful Algolia Crawler, which offers a web interface to create, monitor, edit, and start your crawlers.

How It Works

The DocSearch crawler follows a systematic process to index your documentation:
1

Discover Pages

The crawler starts from your configured start_urls and recursively follows internal links to discover all documentation pages on your site.
2

Extract Content

Using CSS selectors (or XPath), the crawler extracts structured content from each page’s HTML markup. It identifies headings (h1, h2, h3, etc.) to build a hierarchy and extracts text content from paragraphs and lists.
3

Build Records

The extracted content is transformed into JSON records with a hierarchical structure:
  • lvl0 - Typically the page title or h1
  • lvl1 - Usually h2 headings
  • lvl2 - Usually h3 headings
  • lvl3, lvl4, lvl5 - Deeper heading levels
  • text - Paragraph and list content
4

Index to Algolia

The records are pushed to your Algolia index, replacing the previous version. This ensures your search is always up-to-date with your latest documentation.

Crawler Architecture

Legacy Python Crawler

The original DocSearch scraper is a Python-based tool inspired by the Scrapy framework. It’s open source and available in the algolia/docsearch-scraper repository.

Docker Image

The crawler is packaged as a Docker image for easy deployment

Modern Algolia Crawler

For the free DocSearch program, Algolia now uses its modern Crawler infrastructure, which provides:
  • Web Interface: Manage crawlers from the Crawler Dashboard
  • Live Editor: Edit and test your configuration in real-time
  • Monitoring: Track crawl statistics, errors, and performance
  • Scheduling: Automatic weekly crawls by default
  • Manual Triggers: Start crawls on-demand when you update your docs

Content Extraction

HTML Structure Requirements

The crawler works best with well-structured HTML markup:
<article>
  <h1>Getting Started</h1>
  <p>Welcome to our documentation.</p>
  
  <h2>Installation</h2>
  <p>Install the package using npm:</p>
  <code>npm install example</code>
  
  <h2>Configuration</h2>
  <h3>Basic Setup</h3>
  <p>Configure your application...</p>
</article>

Selector-Based Extraction

The crawler uses CSS selectors to target specific elements:
{
  "selectors": {
    "lvl0": "article h1",
    "lvl1": "article h2",
    "lvl2": "article h3",
    "lvl3": "article h4",
    "text": "article p, article li"
  }
}
The text selector is mandatory. We highly recommend setting at least lvl0, lvl1, and lvl2 for optimal search relevance.

Crawl Frequency

For sites enrolled in the free DocSearch program:
  • Default Schedule: Crawls run once per week automatically
  • Manual Crawls: Trigger on-demand from the Crawler Dashboard
  • Updates: Changes to your documentation are reflected after the next crawl
If you need real-time indexing or more frequent crawls, consider running your own crawler or upgrading to a paid Algolia plan.

JavaScript Rendering

By default, the crawler expects server-side rendered content. If your site uses client-side rendering:
{
  "js_render": true,
  "js_wait": 2
}
Client-side crawling is significantly slower than server-side crawling. We strongly recommend implementing server-side rendering for your documentation.

Crawl Scope

The crawler automatically:
  • ✅ Follows all internal links within your domain
  • ✅ Respects start_urls as entry points
  • ❌ Does not follow external links to other domains
  • ❌ Stops at URLs matching stop_urls patterns

Example Configuration

{
  "index_name": "my-docs",
  "start_urls": [
    "https://example.com/docs"
  ],
  "stop_urls": [
    "https://example.com/docs/archive",
    "https://example.com/blog"
  ]
}

Data Privacy

What Gets Indexed

The crawler extracts:
  • Text content from your documentation pages
  • Heading structure and hierarchy
  • URL paths and anchor links
  • Custom metadata (if configured)
The crawler does NOT extract:
  • Images (only alt text if specified)
  • CSS or JavaScript code
  • Forms or interactive elements
  • Content behind authentication (unless credentials provided)

Data Storage

All indexed data is:

Next Steps

Configuration

Learn how to configure the crawler for your documentation

Apply to DocSearch

Get free crawler hosting for your open source project

Troubleshooting

No Results After Crawl

  1. Check that your selectors match your HTML structure
  2. Verify that pages are accessible (not behind authentication)
  3. Ensure server-side rendering is enabled
  4. Review crawl logs in the Crawler Dashboard

Incomplete Indexing

  1. Check stop_urls configuration
  2. Verify all pages are linked from start_urls
  3. Look for broken internal links
  4. Confirm sitemap URLs (if using sitemap-based crawling)

Poor Search Results

  1. Review selector hierarchy (lvl0 through lvl5)
  2. Exclude irrelevant content using selectors_exclude
  3. Configure synonyms for common terms
  4. Adjust page_rank for important pages
For technical support with the Algolia Crawler, reach out via the Algolia support page.

Build docs developers (and LLMs) love