Skip to main content
DocSearch uses the Algolia Crawler to automatically extract content from your documentation and build a searchable index. Understanding how indexing works helps you optimize your documentation for better search results.

Crawler Overview

The Algolia Crawler is a web scraper that:
  • Visits your documentation pages
  • Extracts content using CSS selectors
  • Creates structured records for Algolia
  • Runs on a schedule (default: weekly)
  • Respects robots.txt and crawl limits
If you’re using the DocSearch program, Algolia manages the crawler for you. For custom implementations, you configure it yourself.

How Content is Indexed

Record Structure

The crawler creates hierarchical records for each page section:
{
  "objectID": "page-url#heading-id",
  "hierarchy": {
    "lvl0": "Documentation",
    "lvl1": "Getting Started",
    "lvl2": "Installation",
    "lvl3": "Using npm",
    "lvl4": null,
    "lvl5": null,
    "lvl6": null
  },
  "content": "Install DocSearch using npm or yarn package manager",
  "url": "https://example.com/docs/installation#npm",
  "anchor": "npm"
}

Hierarchy Levels

DocSearch organizes content into levels:
  • lvl0: Top-level category (e.g., product name, section)
  • lvl1: Page title or main heading (h1)
  • lvl2-6: Subheadings (h2-h6)
  • content: Paragraph text, list items, code snippets

Record Extractor Configuration

The recordExtractor defines how content is extracted from your pages:
recordExtractor: ({ helpers }) => {
  return helpers.docsearch({
    recordProps: {
      lvl0: {
        selectors: "header h1",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
    },
  });
}
1

Define Selectors

Specify CSS selectors for each hierarchy level and content elements.
2

Extract Records

The helper function processes the page and creates Algolia records.
3

Index Records

Records are sent to your Algolia index for searching.

Advanced Extraction Patterns

Fallback Selectors

Use multiple selectors as fallbacks:
recordProps: {
  lvl0: {
    selectors: [".page-title h1", ".header h1", "h1"],
  },
  lvl1: "article h2",
  content: [
    ".main-content p, .main-content li",
    ".content p, .content li",
    "p, li"
  ],
}
The crawler uses the first selector that returns results.

Default Values

Provide fallback text when selectors don’t match:
recordProps: {
  lvl0: {
    selectors: "header h1",
    defaultValue: "Documentation",
  },
  language: {
    selectors: "html[lang]",
    defaultValue: ["en"],
  },
}

DOM Manipulation with Cheerio

Remove unwanted elements before extraction:
recordExtractor: ({ $, helpers }) => {
  // Remove elements you don't want indexed
  $(".edit-link").remove();
  $(".footer").remove();
  $(".sidebar").remove();
  $(".code-copy-button").remove();

  return helpers.docsearch({
    recordProps: {
      lvl0: { selectors: "header h1" },
      lvl1: "article h2",
      lvl2: "article h3",
      content: "main p, main li",
    },
  });
}

Faceting and Filtering

Index custom attributes for filtering:
recordProps: {
  lvl0: { selectors: "header h1" },
  lvl1: "article h2",
  content: "main p, main li",
  // Custom attributes for filtering
  version: {
    selectors: ".version-badge",
    defaultValue: ["latest"],
  },
  language: {
    selectors: "html[lang]",
    defaultValue: ["en"],
  },
  tags: ".article-tags .tag",
}
Then filter in your frontend:
<DocSearch
  appId="YOUR_APP_ID"
  apiKey="YOUR_SEARCH_API_KEY"
  indexName="YOUR_INDEX_NAME"
  searchParameters={{
    facetFilters: ['version:v4', 'language:en'],
  }}
/>
Custom attributes are available as facets in your Algolia index for filtering search results.

Boosting Records with pageRank

Increase the ranking of important pages:
recordExtractor: ({ $, helpers, url }) => {
  const isGettingStarted = url.pathname.includes('/getting-started');
  const isReference = url.pathname.includes('/api-reference');
  
  return helpers.docsearch({
    recordProps: {
      lvl0: { selectors: "header h1" },
      lvl1: "article h2",
      content: "main p, main li",
      // Boost getting started pages
      pageRank: isGettingStarted ? "100" : isReference ? "-50" : "0",
    },
  });
}
The pageRank value is added to Algolia’s computed weight for each result. Higher values appear first. Use string values including negative numbers to de-boost less important content.

Reducing Record Size

Aggregate Content

Combine content records to reduce total record count:
return helpers.docsearch({
  recordProps: {
    lvl0: { selectors: "header h1" },
    lvl1: "article h2",
    lvl2: "article h3",
    content: "main p, main li",
  },
  aggregateContent: true, // Groups content under headings
});
Pages generating more than 750 records will fail to index. Use aggregateContent: true to stay under the limit.

Record Version

Use v3 format to reduce record size:
return helpers.docsearch({
  recordProps: {
    lvl0: { selectors: "header h1" },
    lvl1: "article h2",
    content: "main p, main li",
  },
  recordVersion: "v3", // Removes legacy DocSearch v2 fields
});

Crawler Configuration

Scheduling Crawls

Configure when the crawler runs:
{
  "schedule": "every 1 week",
  "startUrls": [
    "https://docs.example.com/"
  ],
  "actions": [
    {
      "indexName": "docs",
      "pathsToMatch": ["https://docs.example.com/**"],
      "recordExtractor": "/* ... */"
    }
  ]
}

URL Patterns

Control which pages are crawled:
{
  "startUrls": [
    "https://docs.example.com/v4/"
  ],
  "pathsToMatch": [
    "https://docs.example.com/v4/**"
  ]
}

Authentication

Crawl password-protected documentation:
{
  "auth": {
    "username": "docs",
    "password": "${env:DOCS_PASSWORD}"
  }
}
Authentication is available on paid Algolia plans, not the free DocSearch program.

Monitoring Crawls

Access the Algolia Crawler dashboard to:
  • Trigger manual crawls
  • View crawl logs and errors
  • Test URL extraction
  • Monitor index size
  • Preview search results

URL Tester

Test your record extractor on specific URLs:
  1. Navigate to the URL Tester in the crawler dashboard
  2. Enter a documentation URL
  3. View extracted records
  4. Refine your selectors

Search Preview

Test search queries against your index:
  1. Open Search Preview in the dashboard
  2. Enter search terms
  3. View ranked results
  4. Adjust relevance settings

Common Issues

Check that:
  • URLs match pathsToMatch patterns
  • Pages aren’t blocked by robots.txt
  • Authentication is configured correctly
  • Pages return 200 status codes
Verify that:
  • CSS selectors match your page structure
  • Content isn’t inside removed elements
  • Records aren’t exceeding size limits
  • The crawler can access dynamic content
Solutions:
  • Configure canonical URLs
  • Add exclusion patterns for duplicates
  • Use distinct parameter in search
  • Check for trailing slash variations
Reduce records by:
  • Setting aggregateContent: true
  • Removing verbose content selectors
  • Excluding code blocks or examples
  • Splitting large pages into smaller sections

Best Practices

Use Semantic HTML

Structure content with proper heading hierarchy (h1 → h2 → h3) for better indexing.

Keep Selectors Simple

Use stable CSS classes rather than complex selectors that may break.

Test Regularly

Use the URL tester to verify extraction after content changes.

Monitor Index Size

Track record count and optimize extraction to stay within limits.

Next Steps

Record Extractor Reference

Complete reference for record extractor configuration

Build docs developers (and LLMs) love