Search types
The site supports two search modes:General search
Returns docs pages matching the query, sorted by popularity. Served from the
/api/search/v1 endpoint. Example: query clone returns URLs to docs pages about cloning repositories.AI search autocomplete
Returns human-readable full-sentence questions that best match the query. Based on previous searches and popular pages. Served from the
/api/search/ai-search-autocomplete/v1 endpoint. Example: query How do I clone returns How do I clone a repository?- VERSION: a numbered GHES version (e.g.
3.12),ghec, ordotcom - LANGUAGE: one of
es,ja,pt,zh,ru,fr,ko,de - QUERY: any alphanumeric string
Architecture
Elasticsearch stores pre-built indexes that the server queries at runtime. Indexes are populated through a two-step pipeline:- Scrape — fetch each page’s content via the Article API and write structured JSON records to disk
- Index — upload those JSON records into Elasticsearch
/api/article?pathname=<path>) on a locally running server for each indexable page. Each record includes title, intro, breadcrumbs, headings, content (plain text, not HTML), and a unique objectID (the page permalink).
The
objectID is set explicitly to the page permalink. This guarantees that subsequent indexing runs overwrite existing records rather than creating duplicates.Environment configuration
| Variable | Description |
|---|---|
ELASTICSEARCH_URL | URL of the Elasticsearch cluster. Required for search tests and manual indexing. Example: http://localhost:9200/ |
.env file for local development:
Running the pipeline manually
General search
Run the scrape and index steps separately, or together using the combined command.Start the scrape server
The scrape server is a production-mode instance of the docs app running on port 4002 with minimal rendering enabled:This sets
MINIMAL_RENDER=true and CHANGELOG_DISABLED=true to reduce memory usage during scraping.Scrape page content
In a separate terminal, run the scrape script against the running server:To scrape a specific language and version only:The script writes one JSON file per page into the target directory.
AI search autocomplete
AI search autocomplete data comes from an internal data repository, not from scraping. Clonegithub/docs-internal-data to the root of the docs directory, then index:
Text analysis
To analyze how Elasticsearch processes text (useful for debugging relevance issues):Running search tests
Search tests require a running Elasticsearch instance:ELASTICSEARCH_URL=http://localhost:9200/ automatically via the test script.
Language tests that involve search also need the variable:
Production workflow
In production, search indexes are rebuilt automatically by GitHub Actions:| Workflow | Schedule | Scope |
|---|---|---|
index-general-search.yml | Every 4 hours | All versions and languages |
index-autocomplete-search.yml | Daily | AI autocomplete data |
main, trigger index-general-search.yml with a specific version and language to reduce run time (a single version/language takes 5–10 minutes versus ~40 minutes for all).
Key files
| Path | Description |
|---|---|
src/search/components/Search.tsx | Browser-side search input component |
src/search/components/SearchResults.tsx | Browser-side search results rendering |
src/search/middleware/general-search-middleware.ts | Server-side entrypoint for /search page |
src/search/middleware/search-routes/ | API route handlers for search endpoints |
src/search/scripts/scrape/ | Scrape scripts and lib/build-records-from-api.ts |
src/search/scripts/index/ | Indexing scripts for general search and autocomplete |
src/search/scripts/analyze-text.ts | Text analysis utility |
src/search/tests/ | Search tests (require ELASTICSEARCH_URL) |
Search features
- Typo tolerance — Elasticsearch returns results even for misspelled queries.
- Advanced query syntax — Supports exact matching with quotes (
"exact phrase") and term exclusion with a minus sign (-excluded). Enabled in the browser client. - Multilingual — Indexes exist for each supported language. Search respects the language of the current docs URL.
- Weighted attributes — Title is ranked higher than body content.
- Version-scoped — Each query targets the index for the requested GitHub product version.
There is a lag of up to 4 hours between content changes merging to
main and those changes appearing in search results, due to the indexing schedule.