Kosh’s search engine runs entirely in the browser using WebAssembly, providing instant full-text search without server infrastructure. Built with Go and compiled to WASM, it delivers industry-standard relevance ranking with advanced features like typo tolerance and phrase matching.
Architecture
The search engine consists of three core components:
builder/search/
├── engine.go # BM25 scoring and search orchestration
├── analyzer.go # Text tokenization and normalization
├── stemmer.go # Porter stemmer for English
└── fuzzy.go # Levenshtein distance and phrase parsing
At build time, Kosh indexes all content and generates a compact binary index (search.bin). The WASM module (search.wasm) loads this index in the browser and performs all searches client-side.
BM25 Relevance Scoring
Kosh uses BM25 (Best Matching 25), the industry-standard probabilistic ranking function used by Elasticsearch and Lucene. Unlike simple keyword matching, BM25 considers:
- Term Frequency (TF): How often a word appears in a document
- Inverse Document Frequency (IDF): How rare a word is across all documents
- Document Length Normalization: Prevents longer documents from dominating results
Implementation
From builder/search/engine.go:93-115:
k1 := 1.2 // Term frequency saturation parameter
b := 0.75 // Document length normalization
for _, term := range queryTerms {
if posts, ok := index.Inverted[term]; ok {
df := len(posts)
idf := math.Log(1 + (float64(index.TotalDocs)-float64(df)+0.5)/(float64(df)+0.5))
for postID, freq := range posts {
docLen := float64(index.DocLens[postID])
score := idf * (float64(freq) * (k1 + 1)) /
(float64(freq) + k1*(1-b+b*(docLen/index.AvgDocLen)))
scores[postID] += score
}
}
}
Scoring Boosts
| Match Type | Score Bonus | Use Case |
|---|
| Phrase match in title | +30.0 | ”machine learning” in title |
| Phrase match in content | +15.0 | Exact phrase anywhere |
| Title match | +10.0 | Query word in title |
| Tag match | +5.0 | Query matches tag |
Text Analysis Pipeline
Before indexing or searching, all text passes through a standardized analysis pipeline:
1. Tokenization
From builder/search/analyzer.go:105-133:
func TokenizeWithUnicode(text string) []string {
tokens := make([]string, 0, len(text)/5)
var buf strings.Builder
for _, r := range text {
if unicode.IsLetter(r) || unicode.IsNumber(r) {
buf.WriteRune(r)
} else if buf.Len() > 0 {
tokens = append(tokens, buf.String())
buf.Reset()
}
}
return tokens
}
This tokenizer:
- Handles full Unicode (supports non-Latin scripts)
- Preserves numbers (“HTTP2”, “Base64”)
- Splits on punctuation and whitespace
2. Stop Word Filtering
Common words like “the”, “and”, “is” are removed to improve relevance. The filter includes 115+ English stop words from builder/search/analyzer.go:9-37.
var stopWords = map[string]bool{
"a": true, "an": true, "and": true, "are": true,
"the": true, "their": true, "this": true,
// ... 115+ total words
}
Stop words are filtered during indexing and search. This means searching for “the quick brown fox” only matches “quick brown fox”.
3. Porter Stemming
The Porter Stemming Algorithm reduces words to their root form, so “running”, “runs”, and “runner” all match “run”.
From builder/search/stemmer.go:12-27:
var stemCache sync.Map // Cache for performance
func StemCached(word string) string {
if cached, ok := stemCache.Load(word); ok {
return cached.(string)
}
result := stem(word) // Apply Porter algorithm
stemCache.Store(word, result)
return result
}
Stemming examples:
"running" → "run"
"argued" → "argu"
"effective" → "effect"
Stemming is cached using sync.Map for ~76x speedup on repeated words. Most documents reuse common words extensively.
Fuzzy Matching
Fuzzy search corrects typos using Levenshtein edit distance (the minimum number of single-character edits to transform one word into another).
Configuration
From builder/search/fuzzy.go:8:
const MaxEditDistance = 2 // Allow up to 2 character differences
Example matches:
"transformr" → "transformer" (1 deletion)
"machien" → "machine" (1 substitution)
"learninng" → "learning" (1 deletion)
Trigram Optimization
For large indexes, scanning every term is slow. Kosh pre-builds a trigram (3-character n-gram) index to quickly find fuzzy candidates.
From builder/search/fuzzy.go:89-116:
func FuzzyExpandWithNgrams(term string, ngramIndex map[string][]string, maxDist int) []string {
trigrams := generateTrigrams(term) // "cat" → ["cat"]
// "search" → ["sea", "ear", "arc", "rch"]
candidateScores := make(map[string]int)
for _, tg := range trigrams {
if candidates, ok := ngramIndex[tg]; ok {
for _, cand := range candidates {
candidateScores[cand]++ // Count shared trigrams
}
}
}
// Only check edit distance for candidates with sufficient overlap
for cand, score := range candidateScores {
if score >= len(trigrams)/2 {
if FuzzyMatch(term, cand, maxDist) {
results = append(results, cand)
}
}
}
}
Performance: ~20% faster than brute-force fuzzy matching.
Fuzzy Score Penalty
Fuzzy matches receive a 0.7x score multiplier to rank exact matches higher (from engine.go:42).
Phrase Search
Enclose queries in quotes for exact phrase matching:
"machine learning" # Must appear exactly as written
From builder/search/fuzzy.go:173-208:
func ParseQuery(query string) ParsedQuery {
var phraseBuf strings.Builder
inPhrase := false
for _, r := range query {
if r == '"' {
if inPhrase {
phrase := strings.TrimSpace(phraseBuf.String())
result.Phrases = append(result.Phrases, strings.ToLower(phrase))
phraseBuf.Reset()
}
inPhrase = !inPhrase
} else if inPhrase {
phraseBuf.WriteRune(r)
}
}
}
Phrase matches receive 2x higher scores than regular matches.
Query Syntax
Kosh supports multiple query types:
| Query | Behavior |
|---|
machine learning | Terms search (stemmed, fuzzy-tolerant) |
"machine learning" | Exact phrase match |
tag:transformer | Filter by tag |
tag:nlp attention | Tag filter + terms search |
Index Encoding
The search index uses msgpack + gzip for optimal size and speed.
| Format | Size | Decode Speed |
|---|
| GOB | 285 KB | 1.0x (baseline) |
| msgpack | ~200 KB | 2.5x faster |
Benefits of msgpack:
- 30% smaller than GOB
- Language-agnostic (can be read by JS, Python, Rust)
- Faster deserialization in WASM
From AGENTS.md:
Output Files:
| File | Location | Size |
|------|----------|------|
| WASM source | internal/build/wasm/search.wasm | ~4.2 MB |
| Deployed WASM | public/static/wasm/search.wasm | Extracted from CLI |
| Search index | public/search.bin | ~200 KB (msgpack + gzip) |
Recompiling the WASM Module
If you modify the search engine, you must recompile the WASM module:
# Step 1: Compile WASM
GOOS=js GOARCH=wasm go build -o internal/build/wasm/search.wasm ./cmd/search
# Step 2: Rebuild CLI (embeds WASM)
go build -ldflags="-s -w" -o kosh ./cmd/kosh
# Step 3: Clear cache and rebuild
kosh clean --cache
kosh build
When to Recompile
✅ Recompile needed:
- Changes to
builder/search/*.go
- Changes to
cmd/search/main.go
- Changes to
SearchIndex struct
- Msgpack version update
❌ No recompile needed:
- Content changes
- Theme updates
- Configuration changes
| Metric | Value |
|---|
| Index build time | ~50ms for 100 posts |
| WASM load time | ~100ms (first visit) |
| Search latency | <10ms for 1000 posts |
| Memory usage | ~5MB (WASM + index) |
The entire search stack runs client-side with zero server load. Perfect for static hosting on GitHub Pages, Netlify, or Cloudflare Pages.
Version Scoping
In versioned documentation sites, search results are automatically filtered to the current version:
if versionFilter != "all" && post.Version != versionFilter {
continue // Skip posts from other versions
}
Users can search across all versions by changing the version selector to “All Versions”.