Crawler Configuration

Overview

The DocSearch crawler uses a JSON configuration file to define how it should crawl and index your documentation. This configuration specifies which pages to crawl, what content to extract, and how to structure the searchable records.

While the crawler runs in the algolia/docsearch-scraper repository, you manage your configuration through the Algolia Crawler Dashboard if you’re using the free DocSearch program.

Basic Configuration

A minimal DocSearch configuration looks like this:

{
  "index_name": "my-documentation",
  "start_urls": [
    "https://example.com/docs"
  ],
  "selectors": {
    "lvl0": "#content header h1",
    "lvl1": "#content article h2",
    "lvl2": "#content section h3",
    "lvl3": "#content section h4",
    "text": "#content p, #content li"
  }
}

Required Fields

`index_name`

The name of your Algolia index where records will be stored.

{
  "index_name": "my-docs"
}

The apiKey provided by DocSearch is scoped to this specific index name and is a search-only key. You can safely commit it to version control.

`start_urls`

An array of URLs where the crawler begins. It will recursively follow links from these pages.

{
  "start_urls": [
    "https://example.com/docs",
    "https://example.com/api"
  ]
}

Advanced Start URLs

You can provide objects with additional options:

{
  "start_urls": [
    {
      "url": "https://example.com/docs/faq/",
      "selectors_key": "faq",
      "page_rank": 5
    },
    {
      "url": "https://example.com/docs/"
    }
  ]
}

`selectors`

Defines CSS selectors for extracting content and building the hierarchy.

{
  "selectors": {
    "lvl0": "header h1",
    "lvl1": "article h2",
    "lvl2": "article h3",
    "lvl3": "article h4",
    "lvl4": "article h5",
    "lvl5": "article h6",
    "text": "article p, article li"
  }
}

The text selector is required. We recommend setting at least lvl0, lvl1, and lvl2 for good search depth.

Selector Options

String Selectors

The simplest form uses CSS selector strings:

{
  "selectors": {
    "lvl0": ".documentation h1"
  }
}

Object Selectors

For more control, use objects with additional properties:

{
  "selectors": {
    "lvl0": {
      "selector": ".sidebar .active",
      "global": true,
      "default_value": "Documentation"
    }
  }
}

Global Selectors

Mark selectors as global to extract the same value for all records on a page:

{
  "selectors": {
    "lvl0": {
      "selector": ".sidebar .active a",
      "global": true
    }
  }
}

Global selectors are useful for page-level metadata like the current section. Avoid making text selectors global.

Default Values

Provide fallback text when a selector matches nothing:

{
  "selectors": {
    "lvl0": {
      "selector": "header h1",
      "default_value": "Documentation"
    }
  }
}

Strip Characters

Remove decorative characters from extracted text:

{
  "selectors": {
    "lvl1": {
      "selector": "article h2",
      "strip_chars": "#›"
    }
  }
}

Or apply globally to all selectors:

{
  "strip_chars": "#›",
  "selectors": { ... }
}

XPath Selectors

For complex DOM traversal, use XPath instead of CSS:

{
  "selectors": {
    "lvl0": {
      "selector": "//li[@class='active']/../../a",
      "type": "xpath",
      "global": true
    }
  }
}

XPath selectors can be hard to read and maintain. Test them thoroughly in your browser’s DevTools console before deploying.

Multiple Selector Sets

Different pages often have different markup. Use selectors_key to apply different selectors:

{
  "start_urls": [
    {
      "url": "https://example.com/docs/faq/",
      "selectors_key": "faq"
    },
    {
      "url": "https://example.com/docs/"
    }
  ],
  "selectors": {
    "default": {
      "lvl0": ".docs h1",
      "lvl1": ".docs h2",
      "text": ".docs p"
    },
    "faq": {
      "lvl0": ".faq h1",
      "lvl1": ".faq .question",
      "text": ".faq .answer"
    }
  }
}

Always include a default selector set as a fallback.

Optional Configuration

`stop_urls`

Prevent the crawler from visiting certain URLs:

{
  "stop_urls": [
    "https://example.com/docs/changelog",
    "https://example.com/docs/archive",
    ".*/archive/.*"
  ]
}

Supports regular expressions:

{
  "stop_urls": [
    "https://example\\.com/docs/v[0-9]+/.*"
  ]
}

`selectors_exclude`

Remove elements from pages before extraction:

{
  "selectors_exclude": [
    ".sidebar",
    ".table-of-contents",
    "footer",
    ".deprecated"
  ]
}

Use this to exclude navigation, footers, or other repetitive content that shouldn’t be indexed.

`scrape_start_urls`

Skip extracting content from the start URLs themselves:

{
  "scrape_start_urls": false
}

Useful when start URLs are landing pages without actual documentation content.

`min_indexed_level`

Only index records with a minimum hierarchy depth:

{
  "min_indexed_level": 2
}

With min_indexed_level: 2, only records with at least lvl0, lvl1, and lvl2 set will be indexed.

`only_content_level`

Index only text content, not headings:

{
  "only_content_level": true
}

This ignores min_indexed_level and may reduce search quality.

Advanced Features

URL Variables and Faceting

Extract variables from URLs for filtering:

{
  "start_urls": [
    {
      "url": "https://example.com/docs/(?P<lang>.*?)/(?P<version>.*?)/",
      "variables": {
        "lang": ["en", "fr", "es"],
        "version": ["latest", "v2", "v1"]
      }
    }
  ]
}

Then filter in your frontend:

docsearch({
  // ...
  algoliaOptions: {
    facetFilters: ['lang:en', 'version:latest']
  }
});

Custom Tags

Add arbitrary tags to pages:

{
  "start_urls": [
    {
      "url": "https://example.com/docs/concepts/",
      "tags": ["concepts", "beginner"]
    }
  ]
}

Filter by tags:

docsearch({
  // ...
  algoliaOptions: {
    facetFilters: ['tags:concepts']
  }
});

Page Rank

Boost specific pages in search results:

{
  "start_urls": [
    {
      "url": "https://example.com/docs/getting-started/",
      "page_rank": 10
    },
    {
      "url": "https://example.com/docs/advanced/",
      "page_rank": 1
    }
  ]
}

Higher values rank first. Accepts negative values.

Sitemap-Based Crawling

Use XML sitemaps instead of following links:

{
  "sitemap_urls": [
    "https://example.com/sitemap.xml"
  ]
}

Include alternate language versions:

{
  "sitemap_urls": ["https://example.com/sitemap.xml"],
  "sitemap_alternate_links": true
}

JavaScript Rendering

Enable Browser Emulation

For client-side rendered sites:

{
  "js_render": true
}

This significantly slows crawling. We strongly recommend server-side rendering for documentation.

Wait for Content

Give slow sites time to render:

{
  "js_render": true,
  "js_wait": 2
}

Waits 2 seconds before extracting content.

Hash-Based URLs

For single-page apps using URL fragments:

{
  "js_render": true,
  "use_anchors": true
}

Custom User Agent

Override the default user agent:

{
  "user_agent": "CustomBot/1.0"
}

Defaults:

Without js_render: Algolia DocSearch Crawler
With js_render: Chrome headless user agent

Algolia Settings

Custom Index Settings

Override default Algolia settings:

{
  "custom_settings": {
    "separatorsToIndex": "_/",
    "attributesToSnippet": ["content:10"]
  }
}

The default settings are optimized for documentation. Only change if you have specific requirements.

Synonyms

Define term equivalencies:

{
  "custom_settings": {
    "synonyms": [
      ["js", "javascript"],
      ["es6", "ECMAScript6", "ECMAScript2015"],
      ["css", "stylesheet"]
    ]
  }
}

Configuration Examples

Simple Documentation Site

{
  "index_name": "my-docs",
  "start_urls": ["https://example.com/docs"],
  "selectors": {
    "lvl0": "header h1",
    "lvl1": "article h2",
    "lvl2": "article h3",
    "text": "article p, article li"
  },
  "selectors_exclude": [".sidebar", "footer"]
}

Multi-Version Documentation

{
  "index_name": "versioned-docs",
  "start_urls": [
    {
      "url": "https://example.com/docs/(?P<version>.*?)/",
      "variables": {
        "version": ["v3", "v2", "v1"]
      }
    }
  ],
  "selectors": {
    "lvl0": "nav .active",
    "lvl1": "article h1",
    "lvl2": "article h2",
    "text": "article p"
  }
}

API Reference with Guides

{
  "index_name": "api-docs",
  "start_urls": [
    {
      "url": "https://example.com/guides/",
      "selectors_key": "guides",
      "page_rank": 5
    },
    {
      "url": "https://example.com/api/",
      "selectors_key": "api",
      "page_rank": 3
    }
  ],
  "selectors": {
    "guides": {
      "lvl0": ".guide h1",
      "lvl1": ".guide h2",
      "lvl2": ".guide h3",
      "text": ".guide p"
    },
    "api": {
      "lvl0": ".api .section-name",
      "lvl1": ".api .method-name",
      "lvl2": ".api .param-name",
      "text": ".api .description"
    }
  }
}

Testing Your Configuration

Create Configuration

Write your JSON configuration based on your site’s HTML structure.

Test Selectors

Use browser DevTools to verify selectors match the correct elements.

Run Test Crawl

If self-hosting, run a test crawl locally. For the free program, test in the Crawler Dashboard editor.

Verify Results

Check indexed records in the Algolia Dashboard to ensure proper extraction.

Next Steps

Getting Started

Learn how the crawler works

Apply to DocSearch

Get free hosting for your open source project

Additional Resources

Algolia Crawler Documentation

Complete Algolia Crawler reference

Example Configurations

Browse real-world DocSearch configurations

Frameworks

Crawler

​Overview

​Basic Configuration

​Required Fields

​index_name

​start_urls

​Advanced Start URLs

​selectors

​Selector Options

​String Selectors

​Object Selectors

​Global Selectors

​Default Values

​Strip Characters

​XPath Selectors

​Multiple Selector Sets

​Optional Configuration

​stop_urls

​selectors_exclude

​scrape_start_urls

​min_indexed_level

​only_content_level

​Advanced Features

​URL Variables and Faceting

​Custom Tags

​Page Rank

​Sitemap-Based Crawling

​JavaScript Rendering

​Enable Browser Emulation

​Wait for Content

​Hash-Based URLs

​Custom User Agent

​Algolia Settings

​Custom Index Settings

​Synonyms

​Configuration Examples

​Simple Documentation Site

​Multi-Version Documentation

​API Reference with Guides

​Testing Your Configuration

​Next Steps

Getting Started

Apply to DocSearch

​Additional Resources

Algolia Crawler Documentation

Example Configurations

Build docs developers (and LLMs) love

Overview

Basic Configuration

Required Fields

`index_name`

`start_urls`

Advanced Start URLs

`selectors`

Selector Options

String Selectors

Object Selectors

Global Selectors

Default Values

Strip Characters

XPath Selectors

Multiple Selector Sets

Optional Configuration

`stop_urls`

`selectors_exclude`

`scrape_start_urls`

`min_indexed_level`

`only_content_level`

Advanced Features

URL Variables and Faceting

Custom Tags

Page Rank

Sitemap-Based Crawling

JavaScript Rendering

Enable Browser Emulation

Wait for Content

Hash-Based URLs

Custom User Agent

Algolia Settings

Custom Index Settings

Synonyms

Configuration Examples

Simple Documentation Site

Multi-Version Documentation

API Reference with Guides

Testing Your Configuration

Next Steps

Additional Resources