Overview
The DocSearch crawler is the backbone of the DocSearch service. It automatically extracts content from your documentation website, transforms it into searchable records, and pushes it to an Algolia index. This enables fast, relevant search results for your users.DocSearch now leverages the powerful Algolia Crawler, which offers a web interface to create, monitor, edit, and start your crawlers.
How It Works
The DocSearch crawler follows a systematic process to index your documentation:Discover Pages
The crawler starts from your configured
start_urls and recursively follows internal links to discover all documentation pages on your site.Extract Content
Using CSS selectors (or XPath), the crawler extracts structured content from each page’s HTML markup. It identifies headings (h1, h2, h3, etc.) to build a hierarchy and extracts text content from paragraphs and lists.
Build Records
The extracted content is transformed into JSON records with a hierarchical structure:
lvl0- Typically the page title or h1lvl1- Usually h2 headingslvl2- Usually h3 headingslvl3,lvl4,lvl5- Deeper heading levelstext- Paragraph and list content
Crawler Architecture
Legacy Python Crawler
The original DocSearch scraper is a Python-based tool inspired by the Scrapy framework. It’s open source and available in the algolia/docsearch-scraper repository.Docker Image
The crawler is packaged as a Docker image for easy deployment
Modern Algolia Crawler
For the free DocSearch program, Algolia now uses its modern Crawler infrastructure, which provides:- Web Interface: Manage crawlers from the Crawler Dashboard
- Live Editor: Edit and test your configuration in real-time
- Monitoring: Track crawl statistics, errors, and performance
- Scheduling: Automatic weekly crawls by default
- Manual Triggers: Start crawls on-demand when you update your docs
Content Extraction
HTML Structure Requirements
The crawler works best with well-structured HTML markup:Selector-Based Extraction
The crawler uses CSS selectors to target specific elements:The
text selector is mandatory. We highly recommend setting at least lvl0, lvl1, and lvl2 for optimal search relevance.Crawl Frequency
For sites enrolled in the free DocSearch program:- Default Schedule: Crawls run once per week automatically
- Manual Crawls: Trigger on-demand from the Crawler Dashboard
- Updates: Changes to your documentation are reflected after the next crawl
JavaScript Rendering
By default, the crawler expects server-side rendered content. If your site uses client-side rendering:Crawl Scope
Following Links
The crawler automatically:- ✅ Follows all internal links within your domain
- ✅ Respects
start_urlsas entry points - ❌ Does not follow external links to other domains
- ❌ Stops at URLs matching
stop_urlspatterns
Example Configuration
Data Privacy
What Gets Indexed
The crawler extracts:- Text content from your documentation pages
- Heading structure and hierarchy
- URL paths and anchor links
- Custom metadata (if configured)
- Images (only alt text if specified)
- CSS or JavaScript code
- Forms or interactive elements
- Content behind authentication (unless credentials provided)
Data Storage
All indexed data is:- Stored on Algolia’s global infrastructure
- Replicated across regions for performance
- Subject to Algolia’s privacy policy
Next Steps
Configuration
Learn how to configure the crawler for your documentation
Apply to DocSearch
Get free crawler hosting for your open source project
Troubleshooting
No Results After Crawl
- Check that your selectors match your HTML structure
- Verify that pages are accessible (not behind authentication)
- Ensure server-side rendering is enabled
- Review crawl logs in the Crawler Dashboard
Incomplete Indexing
- Check
stop_urlsconfiguration - Verify all pages are linked from
start_urls - Look for broken internal links
- Confirm sitemap URLs (if using sitemap-based crawling)
Poor Search Results
- Review selector hierarchy (lvl0 through lvl5)
- Exclude irrelevant content using
selectors_exclude - Configure synonyms for common terms
- Adjust
page_rankfor important pages
For technical support with the Algolia Crawler, reach out via the Algolia support page.
