Crawler Overview
The Algolia Crawler is a web scraper that:- Visits your documentation pages
- Extracts content using CSS selectors
- Creates structured records for Algolia
- Runs on a schedule (default: weekly)
- Respects robots.txt and crawl limits
If you’re using the DocSearch program, Algolia manages the crawler for you. For custom implementations, you configure it yourself.
How Content is Indexed
Record Structure
The crawler creates hierarchical records for each page section:Hierarchy Levels
DocSearch organizes content into levels:- lvl0: Top-level category (e.g., product name, section)
- lvl1: Page title or main heading (
h1) - lvl2-6: Subheadings (
h2-h6) - content: Paragraph text, list items, code snippets
Record Extractor Configuration
TherecordExtractor defines how content is extracted from your pages:
Advanced Extraction Patterns
Fallback Selectors
Use multiple selectors as fallbacks:Default Values
Provide fallback text when selectors don’t match:DOM Manipulation with Cheerio
Remove unwanted elements before extraction:Faceting and Filtering
Index custom attributes for filtering:Custom attributes are available as facets in your Algolia index for filtering search results.
Boosting Records with pageRank
Increase the ranking of important pages:Understanding pageRank
Understanding pageRank
The
pageRank value is added to Algolia’s computed weight for each result. Higher values appear first. Use string values including negative numbers to de-boost less important content.Reducing Record Size
Aggregate Content
Combine content records to reduce total record count:Record Version
Use v3 format to reduce record size:Crawler Configuration
Scheduling Crawls
Configure when the crawler runs:URL Patterns
Control which pages are crawled:Authentication
Crawl password-protected documentation:Authentication is available on paid Algolia plans, not the free DocSearch program.
Monitoring Crawls
Access the Algolia Crawler dashboard to:- Trigger manual crawls
- View crawl logs and errors
- Test URL extraction
- Monitor index size
- Preview search results
URL Tester
Test your record extractor on specific URLs:- Navigate to the URL Tester in the crawler dashboard
- Enter a documentation URL
- View extracted records
- Refine your selectors
Search Preview
Test search queries against your index:- Open Search Preview in the dashboard
- Enter search terms
- View ranked results
- Adjust relevance settings
Common Issues
Pages aren't being crawled
Pages aren't being crawled
Check that:
- URLs match
pathsToMatchpatterns - Pages aren’t blocked by robots.txt
- Authentication is configured correctly
- Pages return 200 status codes
Content is missing from results
Content is missing from results
Verify that:
- CSS selectors match your page structure
- Content isn’t inside removed elements
- Records aren’t exceeding size limits
- The crawler can access dynamic content
Duplicate results appearing
Duplicate results appearing
Solutions:
- Configure canonical URLs
- Add exclusion patterns for duplicates
- Use
distinctparameter in search - Check for trailing slash variations
Too many records error
Too many records error
Reduce records by:
- Setting
aggregateContent: true - Removing verbose content selectors
- Excluding code blocks or examples
- Splitting large pages into smaller sections
Best Practices
Use Semantic HTML
Structure content with proper heading hierarchy (h1 → h2 → h3) for better indexing.
Keep Selectors Simple
Use stable CSS classes rather than complex selectors that may break.
Test Regularly
Use the URL tester to verify extraction after content changes.
Monitor Index Size
Track record count and optimize extraction to stay within limits.
Next Steps
Record Extractor Reference
Complete reference for record extractor configuration
