Overview
The DocSearch crawler uses a JSON configuration file to define how it should crawl and index your documentation. This configuration specifies which pages to crawl, what content to extract, and how to structure the searchable records.
Basic Configuration
A minimal DocSearch configuration looks like this:
{
"index_name" : "my-documentation" ,
"start_urls" : [
"https://example.com/docs"
],
"selectors" : {
"lvl0" : "#content header h1" ,
"lvl1" : "#content article h2" ,
"lvl2" : "#content section h3" ,
"lvl3" : "#content section h4" ,
"text" : "#content p, #content li"
}
}
Required Fields
index_name
The name of your Algolia index where records will be stored.
{
"index_name" : "my-docs"
}
The apiKey provided by DocSearch is scoped to this specific index name and is a search-only key. You can safely commit it to version control.
start_urls
An array of URLs where the crawler begins. It will recursively follow links from these pages.
{
"start_urls" : [
"https://example.com/docs" ,
"https://example.com/api"
]
}
Advanced Start URLs
You can provide objects with additional options:
{
"start_urls" : [
{
"url" : "https://example.com/docs/faq/" ,
"selectors_key" : "faq" ,
"page_rank" : 5
},
{
"url" : "https://example.com/docs/"
}
]
}
selectors
Defines CSS selectors for extracting content and building the hierarchy.
{
"selectors" : {
"lvl0" : "header h1" ,
"lvl1" : "article h2" ,
"lvl2" : "article h3" ,
"lvl3" : "article h4" ,
"lvl4" : "article h5" ,
"lvl5" : "article h6" ,
"text" : "article p, article li"
}
}
The text selector is required . We recommend setting at least lvl0, lvl1, and lvl2 for good search depth.
Selector Options
String Selectors
The simplest form uses CSS selector strings:
{
"selectors" : {
"lvl0" : ".documentation h1"
}
}
Object Selectors
For more control, use objects with additional properties:
{
"selectors" : {
"lvl0" : {
"selector" : ".sidebar .active" ,
"global" : true ,
"default_value" : "Documentation"
}
}
}
Global Selectors
Mark selectors as global to extract the same value for all records on a page:
{
"selectors" : {
"lvl0" : {
"selector" : ".sidebar .active a" ,
"global" : true
}
}
}
Global selectors are useful for page-level metadata like the current section. Avoid making text selectors global.
Default Values
Provide fallback text when a selector matches nothing:
{
"selectors" : {
"lvl0" : {
"selector" : "header h1" ,
"default_value" : "Documentation"
}
}
}
Strip Characters
Remove decorative characters from extracted text:
{
"selectors" : {
"lvl1" : {
"selector" : "article h2" ,
"strip_chars" : "#›"
}
}
}
Or apply globally to all selectors:
{
"strip_chars" : "#›" ,
"selectors" : { ... }
}
XPath Selectors
For complex DOM traversal, use XPath instead of CSS:
{
"selectors" : {
"lvl0" : {
"selector" : "//li[@class='active']/../../a" ,
"type" : "xpath" ,
"global" : true
}
}
}
XPath selectors can be hard to read and maintain. Test them thoroughly in your browser’s DevTools console before deploying.
Multiple Selector Sets
Different pages often have different markup. Use selectors_key to apply different selectors:
{
"start_urls" : [
{
"url" : "https://example.com/docs/faq/" ,
"selectors_key" : "faq"
},
{
"url" : "https://example.com/docs/"
}
],
"selectors" : {
"default" : {
"lvl0" : ".docs h1" ,
"lvl1" : ".docs h2" ,
"text" : ".docs p"
},
"faq" : {
"lvl0" : ".faq h1" ,
"lvl1" : ".faq .question" ,
"text" : ".faq .answer"
}
}
}
Always include a default selector set as a fallback.
Optional Configuration
stop_urls
Prevent the crawler from visiting certain URLs:
{
"stop_urls" : [
"https://example.com/docs/changelog" ,
"https://example.com/docs/archive" ,
".*/archive/.*"
]
}
Supports regular expressions:
{
"stop_urls" : [
"https://example \\ .com/docs/v[0-9]+/.*"
]
}
selectors_exclude
Remove elements from pages before extraction:
{
"selectors_exclude" : [
".sidebar" ,
".table-of-contents" ,
"footer" ,
".deprecated"
]
}
Use this to exclude navigation, footers, or other repetitive content that shouldn’t be indexed.
scrape_start_urls
Skip extracting content from the start URLs themselves:
{
"scrape_start_urls" : false
}
Useful when start URLs are landing pages without actual documentation content.
min_indexed_level
Only index records with a minimum hierarchy depth:
{
"min_indexed_level" : 2
}
With min_indexed_level: 2, only records with at least lvl0, lvl1, and lvl2 set will be indexed.
only_content_level
Index only text content, not headings:
{
"only_content_level" : true
}
This ignores min_indexed_level and may reduce search quality.
Advanced Features
URL Variables and Faceting
Extract variables from URLs for filtering:
{
"start_urls" : [
{
"url" : "https://example.com/docs/(?P<lang>.*?)/(?P<version>.*?)/" ,
"variables" : {
"lang" : [ "en" , "fr" , "es" ],
"version" : [ "latest" , "v2" , "v1" ]
}
}
]
}
Then filter in your frontend:
docsearch ({
// ...
algoliaOptions: {
facetFilters: [ 'lang:en' , 'version:latest' ]
}
});
Add arbitrary tags to pages:
{
"start_urls" : [
{
"url" : "https://example.com/docs/concepts/" ,
"tags" : [ "concepts" , "beginner" ]
}
]
}
Filter by tags:
docsearch ({
// ...
algoliaOptions: {
facetFilters: [ 'tags:concepts' ]
}
});
Page Rank
Boost specific pages in search results:
{
"start_urls" : [
{
"url" : "https://example.com/docs/getting-started/" ,
"page_rank" : 10
},
{
"url" : "https://example.com/docs/advanced/" ,
"page_rank" : 1
}
]
}
Higher values rank first. Accepts negative values.
Sitemap-Based Crawling
Use XML sitemaps instead of following links:
{
"sitemap_urls" : [
"https://example.com/sitemap.xml"
]
}
Include alternate language versions:
{
"sitemap_urls" : [ "https://example.com/sitemap.xml" ],
"sitemap_alternate_links" : true
}
JavaScript Rendering
Enable Browser Emulation
For client-side rendered sites:
This significantly slows crawling. We strongly recommend server-side rendering for documentation.
Wait for Content
Give slow sites time to render:
{
"js_render" : true ,
"js_wait" : 2
}
Waits 2 seconds before extracting content.
Hash-Based URLs
For single-page apps using URL fragments:
{
"js_render" : true ,
"use_anchors" : true
}
Custom User Agent
Override the default user agent:
{
"user_agent" : "CustomBot/1.0"
}
Defaults:
Without js_render: Algolia DocSearch Crawler
With js_render: Chrome headless user agent
Algolia Settings
Custom Index Settings
Override default Algolia settings:
{
"custom_settings" : {
"separatorsToIndex" : "_/" ,
"attributesToSnippet" : [ "content:10" ]
}
}
The default settings are optimized for documentation. Only change if you have specific requirements.
Synonyms
Define term equivalencies:
{
"custom_settings" : {
"synonyms" : [
[ "js" , "javascript" ],
[ "es6" , "ECMAScript6" , "ECMAScript2015" ],
[ "css" , "stylesheet" ]
]
}
}
Configuration Examples
Simple Documentation Site
{
"index_name" : "my-docs" ,
"start_urls" : [ "https://example.com/docs" ],
"selectors" : {
"lvl0" : "header h1" ,
"lvl1" : "article h2" ,
"lvl2" : "article h3" ,
"text" : "article p, article li"
},
"selectors_exclude" : [ ".sidebar" , "footer" ]
}
Multi-Version Documentation
{
"index_name" : "versioned-docs" ,
"start_urls" : [
{
"url" : "https://example.com/docs/(?P<version>.*?)/" ,
"variables" : {
"version" : [ "v3" , "v2" , "v1" ]
}
}
],
"selectors" : {
"lvl0" : "nav .active" ,
"lvl1" : "article h1" ,
"lvl2" : "article h2" ,
"text" : "article p"
}
}
API Reference with Guides
{
"index_name" : "api-docs" ,
"start_urls" : [
{
"url" : "https://example.com/guides/" ,
"selectors_key" : "guides" ,
"page_rank" : 5
},
{
"url" : "https://example.com/api/" ,
"selectors_key" : "api" ,
"page_rank" : 3
}
],
"selectors" : {
"guides" : {
"lvl0" : ".guide h1" ,
"lvl1" : ".guide h2" ,
"lvl2" : ".guide h3" ,
"text" : ".guide p"
},
"api" : {
"lvl0" : ".api .section-name" ,
"lvl1" : ".api .method-name" ,
"lvl2" : ".api .param-name" ,
"text" : ".api .description"
}
}
}
Testing Your Configuration
Create Configuration
Write your JSON configuration based on your site’s HTML structure.
Test Selectors
Use browser DevTools to verify selectors match the correct elements.
Run Test Crawl
If self-hosting, run a test crawl locally. For the free program, test in the Crawler Dashboard editor.
Verify Results
Check indexed records in the Algolia Dashboard to ensure proper extraction.
Next Steps
Getting Started Learn how the crawler works
Apply to DocSearch Get free hosting for your open source project
Additional Resources
Algolia Crawler Documentation Complete Algolia Crawler reference
Example Configurations Browse real-world DocSearch configurations