Getting Started with the DocSearch Crawler

Overview

The DocSearch crawler is the backbone of the DocSearch service. It automatically extracts content from your documentation website, transforms it into searchable records, and pushes it to an Algolia index. This enables fast, relevant search results for your users.

DocSearch now leverages the powerful Algolia Crawler, which offers a web interface to create, monitor, edit, and start your crawlers.

How It Works

The DocSearch crawler follows a systematic process to index your documentation:

Discover Pages

The crawler starts from your configured start_urls and recursively follows internal links to discover all documentation pages on your site.

Extract Content

Using CSS selectors (or XPath), the crawler extracts structured content from each page’s HTML markup. It identifies headings (h1, h2, h3, etc.) to build a hierarchy and extracts text content from paragraphs and lists.

Build Records

The extracted content is transformed into JSON records with a hierarchical structure:

lvl0 - Typically the page title or h1
lvl1 - Usually h2 headings
lvl2 - Usually h3 headings
lvl3, lvl4, lvl5 - Deeper heading levels
text - Paragraph and list content

Index to Algolia

The records are pushed to your Algolia index, replacing the previous version. This ensures your search is always up-to-date with your latest documentation.

Crawler Architecture

Legacy Python Crawler

The original DocSearch scraper is a Python-based tool inspired by the Scrapy framework. It’s open source and available in the algolia/docsearch-scraper repository.

Docker Image

The crawler is packaged as a Docker image for easy deployment

Modern Algolia Crawler

For the free DocSearch program, Algolia now uses its modern Crawler infrastructure, which provides:

Web Interface: Manage crawlers from the Crawler Dashboard
Live Editor: Edit and test your configuration in real-time
Monitoring: Track crawl statistics, errors, and performance
Scheduling: Automatic weekly crawls by default
Manual Triggers: Start crawls on-demand when you update your docs

Content Extraction

HTML Structure Requirements

The crawler works best with well-structured HTML markup:

<article>
  <h1>Getting Started</h1>
  <p>Welcome to our documentation.</p>
  
  <h2>Installation</h2>
  <p>Install the package using npm:</p>
  <code>npm install example</code>
  
  <h2>Configuration</h2>
  <h3>Basic Setup</h3>
  <p>Configure your application...</p>
</article>

Selector-Based Extraction

The crawler uses CSS selectors to target specific elements:

{
  "selectors": {
    "lvl0": "article h1",
    "lvl1": "article h2",
    "lvl2": "article h3",
    "lvl3": "article h4",
    "text": "article p, article li"
  }
}

The text selector is mandatory. We highly recommend setting at least lvl0, lvl1, and lvl2 for optimal search relevance.

Crawl Frequency

For sites enrolled in the free DocSearch program:

Default Schedule: Crawls run once per week automatically
Manual Crawls: Trigger on-demand from the Crawler Dashboard
Updates: Changes to your documentation are reflected after the next crawl

If you need real-time indexing or more frequent crawls, consider running your own crawler or upgrading to a paid Algolia plan.

JavaScript Rendering

By default, the crawler expects server-side rendered content. If your site uses client-side rendering:

{
  "js_render": true,
  "js_wait": 2
}

Client-side crawling is significantly slower than server-side crawling. We strongly recommend implementing server-side rendering for your documentation.

Crawl Scope

Following Links

The crawler automatically:

✅ Follows all internal links within your domain
✅ Respects start_urls as entry points
❌ Does not follow external links to other domains
❌ Stops at URLs matching stop_urls patterns

Example Configuration

{
  "index_name": "my-docs",
  "start_urls": [
    "https://example.com/docs"
  ],
  "stop_urls": [
    "https://example.com/docs/archive",
    "https://example.com/blog"
  ]
}

Data Privacy

What Gets Indexed

The crawler extracts:

Text content from your documentation pages
Heading structure and hierarchy
URL paths and anchor links
Custom metadata (if configured)

The crawler does NOT extract:

Images (only alt text if specified)
CSS or JavaScript code
Forms or interactive elements
Content behind authentication (unless credentials provided)

Data Storage

All indexed data is:

Stored on Algolia’s global infrastructure
Replicated across regions for performance
Subject to Algolia’s privacy policy

Next Steps

Configuration

Learn how to configure the crawler for your documentation

Apply to DocSearch

Get free crawler hosting for your open source project

Troubleshooting

No Results After Crawl

Check that your selectors match your HTML structure
Verify that pages are accessible (not behind authentication)
Ensure server-side rendering is enabled
Review crawl logs in the Crawler Dashboard

Incomplete Indexing

Check stop_urls configuration
Verify all pages are linked from start_urls
Look for broken internal links
Confirm sitemap URLs (if using sitemap-based crawling)

Poor Search Results

Review selector hierarchy (lvl0 through lvl5)
Exclude irrelevant content using selectors_exclude
Configure synonyms for common terms
Adjust page_rank for important pages

For technical support with the Algolia Crawler, reach out via the Algolia support page.

Frameworks

Crawler

Getting Started with the DocSearch Crawler

Overview

How It Works

Crawler Architecture

Legacy Python Crawler

Docker Image

Modern Algolia Crawler

Content Extraction

HTML Structure Requirements

Selector-Based Extraction

Crawl Frequency

JavaScript Rendering

Crawl Scope

Following Links

Example Configuration

Data Privacy

What Gets Indexed

Data Storage

Next Steps

Configuration

Apply to DocSearch

Troubleshooting

No Results After Crawl

Incomplete Indexing

Poor Search Results

Build docs developers (and LLMs) love

Frameworks

Crawler

​Overview

​How It Works

​Crawler Architecture

​Legacy Python Crawler

Docker Image

​Modern Algolia Crawler

​Content Extraction

​HTML Structure Requirements

​Selector-Based Extraction

​Crawl Frequency

​JavaScript Rendering

​Crawl Scope

​Following Links

​Example Configuration

​Data Privacy

​What Gets Indexed

​Data Storage

​Next Steps

Configuration

Apply to DocSearch

​Troubleshooting

​No Results After Crawl

​Incomplete Indexing

​Poor Search Results

Build docs developers (and LLMs) love

Overview

How It Works

Crawler Architecture

Legacy Python Crawler

Modern Algolia Crawler

Content Extraction

HTML Structure Requirements

Selector-Based Extraction

Crawl Frequency

JavaScript Rendering

Crawl Scope

Following Links

Example Configuration

Data Privacy

What Gets Indexed

Data Storage

Next Steps

Troubleshooting

No Results After Crawl

Incomplete Indexing

Poor Search Results