Indexing

DocSearch uses the Algolia Crawler to automatically extract content from your documentation and build a searchable index. Understanding how indexing works helps you optimize your documentation for better search results.

Crawler Overview

The Algolia Crawler is a web scraper that:

Visits your documentation pages
Extracts content using CSS selectors
Creates structured records for Algolia
Runs on a schedule (default: weekly)
Respects robots.txt and crawl limits

If you’re using the DocSearch program, Algolia manages the crawler for you. For custom implementations, you configure it yourself.

How Content is Indexed

Record Structure

The crawler creates hierarchical records for each page section:

{
  "objectID": "page-url#heading-id",
  "hierarchy": {
    "lvl0": "Documentation",
    "lvl1": "Getting Started",
    "lvl2": "Installation",
    "lvl3": "Using npm",
    "lvl4": null,
    "lvl5": null,
    "lvl6": null
  },
  "content": "Install DocSearch using npm or yarn package manager",
  "url": "https://example.com/docs/installation#npm",
  "anchor": "npm"
}

Hierarchy Levels

DocSearch organizes content into levels:

lvl0: Top-level category (e.g., product name, section)
lvl1: Page title or main heading (h1)
lvl2-6: Subheadings (h2-h6)
content: Paragraph text, list items, code snippets

Record Extractor Configuration

The recordExtractor defines how content is extracted from your pages:

recordExtractor: ({ helpers }) => {
  return helpers.docsearch({
    recordProps: {
      lvl0: {
        selectors: "header h1",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
    },
  });
}

Define Selectors

Specify CSS selectors for each hierarchy level and content elements.

Extract Records

The helper function processes the page and creates Algolia records.

Index Records

Records are sent to your Algolia index for searching.

Advanced Extraction Patterns

Fallback Selectors

Use multiple selectors as fallbacks:

recordProps: {
  lvl0: {
    selectors: [".page-title h1", ".header h1", "h1"],
  },
  lvl1: "article h2",
  content: [
    ".main-content p, .main-content li",
    ".content p, .content li",
    "p, li"
  ],
}

The crawler uses the first selector that returns results.

Default Values

Provide fallback text when selectors don’t match:

recordProps: {
  lvl0: {
    selectors: "header h1",
    defaultValue: "Documentation",
  },
  language: {
    selectors: "html[lang]",
    defaultValue: ["en"],
  },
}

DOM Manipulation with Cheerio

Remove unwanted elements before extraction:

recordExtractor: ({ $, helpers }) => {
  // Remove elements you don't want indexed
  $(".edit-link").remove();
  $(".footer").remove();
  $(".sidebar").remove();
  $(".code-copy-button").remove();

  return helpers.docsearch({
    recordProps: {
      lvl0: { selectors: "header h1" },
      lvl1: "article h2",
      lvl2: "article h3",
      content: "main p, main li",
    },
  });
}

Faceting and Filtering

Index custom attributes for filtering:

recordProps: {
  lvl0: { selectors: "header h1" },
  lvl1: "article h2",
  content: "main p, main li",
  // Custom attributes for filtering
  version: {
    selectors: ".version-badge",
    defaultValue: ["latest"],
  },
  language: {
    selectors: "html[lang]",
    defaultValue: ["en"],
  },
  tags: ".article-tags .tag",
}

Then filter in your frontend:

<DocSearch
  appId="YOUR_APP_ID"
  apiKey="YOUR_SEARCH_API_KEY"
  indexName="YOUR_INDEX_NAME"
  searchParameters={{
    facetFilters: ['version:v4', 'language:en'],
  }}
/>

Custom attributes are available as facets in your Algolia index for filtering search results.

Boosting Records with pageRank

Increase the ranking of important pages:

recordExtractor: ({ $, helpers, url }) => {
  const isGettingStarted = url.pathname.includes('/getting-started');
  const isReference = url.pathname.includes('/api-reference');
  
  return helpers.docsearch({
    recordProps: {
      lvl0: { selectors: "header h1" },
      lvl1: "article h2",
      content: "main p, main li",
      // Boost getting started pages
      pageRank: isGettingStarted ? "100" : isReference ? "-50" : "0",
    },
  });
}

Understanding pageRank

The pageRank value is added to Algolia’s computed weight for each result. Higher values appear first. Use string values including negative numbers to de-boost less important content.

Reducing Record Size

Aggregate Content

Combine content records to reduce total record count:

return helpers.docsearch({
  recordProps: {
    lvl0: { selectors: "header h1" },
    lvl1: "article h2",
    lvl2: "article h3",
    content: "main p, main li",
  },
  aggregateContent: true, // Groups content under headings
});

Pages generating more than 750 records will fail to index. Use aggregateContent: true to stay under the limit.

Record Version

Use v3 format to reduce record size:

return helpers.docsearch({
  recordProps: {
    lvl0: { selectors: "header h1" },
    lvl1: "article h2",
    content: "main p, main li",
  },
  recordVersion: "v3", // Removes legacy DocSearch v2 fields
});

Crawler Configuration

Scheduling Crawls

Configure when the crawler runs:

{
  "schedule": "every 1 week",
  "startUrls": [
    "https://docs.example.com/"
  ],
  "actions": [
    {
      "indexName": "docs",
      "pathsToMatch": ["https://docs.example.com/**"],
      "recordExtractor": "/* ... */"
    }
  ]
}

URL Patterns

Control which pages are crawled:

{
  "startUrls": [
    "https://docs.example.com/v4/"
  ],
  "pathsToMatch": [
    "https://docs.example.com/v4/**"
  ]
}

Authentication

Crawl password-protected documentation:

{
  "auth": {
    "username": "docs",
    "password": "${env:DOCS_PASSWORD}"
  }
}

Authentication is available on paid Algolia plans, not the free DocSearch program.

Monitoring Crawls

Access the Algolia Crawler dashboard to:

Trigger manual crawls
View crawl logs and errors
Test URL extraction
Monitor index size
Preview search results

URL Tester

Test your record extractor on specific URLs:

Navigate to the URL Tester in the crawler dashboard
Enter a documentation URL
View extracted records
Refine your selectors

Search Preview

Test search queries against your index:

Open Search Preview in the dashboard
Enter search terms
View ranked results
Adjust relevance settings

Common Issues

Pages aren't being crawled

Check that:

URLs match pathsToMatch patterns
Pages aren’t blocked by robots.txt
Authentication is configured correctly
Pages return 200 status codes

Content is missing from results

Verify that:

CSS selectors match your page structure
Content isn’t inside removed elements
Records aren’t exceeding size limits
The crawler can access dynamic content

Duplicate results appearing

Solutions:

Configure canonical URLs
Add exclusion patterns for duplicates
Use distinct parameter in search
Check for trailing slash variations

Too many records error

Reduce records by:

Setting aggregateContent: true
Removing verbose content selectors
Excluding code blocks or examples
Splitting large pages into smaller sections

Best Practices

Use Semantic HTML

Structure content with proper heading hierarchy (h1 → h2 → h3) for better indexing.

Keep Selectors Simple

Use stable CSS classes rather than complex selectors that may break.

Test Regularly

Use the URL tester to verify extraction after content changes.

Monitor Index Size

Track record count and optimize extraction to stay within limits.

Next Steps

Record Extractor Reference

Complete reference for record extractor configuration

Get Started

Installation

Core Concepts

Guides

Advanced

Crawler Overview

How Content is Indexed

Record Structure

Hierarchy Levels

Record Extractor Configuration

Advanced Extraction Patterns

Fallback Selectors

Default Values

DOM Manipulation with Cheerio

Faceting and Filtering

Boosting Records with pageRank

Reducing Record Size

Aggregate Content

Record Version

Crawler Configuration

Scheduling Crawls

URL Patterns

Authentication

Monitoring Crawls

URL Tester

Search Preview

Common Issues

Best Practices

Use Semantic HTML

Keep Selectors Simple

Test Regularly

Monitor Index Size

Next Steps

Record Extractor Reference

Build docs developers (and LLMs) love

Get Started

Installation

Core Concepts

Guides

Advanced

​Crawler Overview

​How Content is Indexed

​Record Structure

​Hierarchy Levels

​Record Extractor Configuration

​Advanced Extraction Patterns

​Fallback Selectors

​Default Values

​DOM Manipulation with Cheerio

​Faceting and Filtering

​Boosting Records with pageRank

​Reducing Record Size

​Aggregate Content

​Record Version

​Crawler Configuration

​Scheduling Crawls

​URL Patterns

​Authentication

​Monitoring Crawls

​URL Tester

​Search Preview

​Common Issues

​Best Practices

Use Semantic HTML

Keep Selectors Simple

Test Regularly

Monitor Index Size

​Next Steps

Record Extractor Reference

Build docs developers (and LLMs) love

Crawler Overview

How Content is Indexed

Record Structure

Hierarchy Levels

Record Extractor Configuration

Advanced Extraction Patterns

Fallback Selectors

Default Values

DOM Manipulation with Cheerio

Faceting and Filtering

Boosting Records with pageRank

Reducing Record Size

Aggregate Content

Record Version

Crawler Configuration

Scheduling Crawls

URL Patterns

Authentication

Monitoring Crawls

URL Tester

Search Preview

Common Issues

Best Practices

Next Steps