Skip to main content
Kosh’s search engine runs entirely in the browser using WebAssembly, providing instant full-text search without server infrastructure. Built with Go and compiled to WASM, it delivers industry-standard relevance ranking with advanced features like typo tolerance and phrase matching.

Architecture

The search engine consists of three core components:
builder/search/
├── engine.go       # BM25 scoring and search orchestration
├── analyzer.go     # Text tokenization and normalization
├── stemmer.go      # Porter stemmer for English
└── fuzzy.go        # Levenshtein distance and phrase parsing
At build time, Kosh indexes all content and generates a compact binary index (search.bin). The WASM module (search.wasm) loads this index in the browser and performs all searches client-side.

BM25 Relevance Scoring

Kosh uses BM25 (Best Matching 25), the industry-standard probabilistic ranking function used by Elasticsearch and Lucene. Unlike simple keyword matching, BM25 considers:
  • Term Frequency (TF): How often a word appears in a document
  • Inverse Document Frequency (IDF): How rare a word is across all documents
  • Document Length Normalization: Prevents longer documents from dominating results

Implementation

From builder/search/engine.go:93-115:
k1 := 1.2  // Term frequency saturation parameter
b := 0.75  // Document length normalization

for _, term := range queryTerms {
    if posts, ok := index.Inverted[term]; ok {
        df := len(posts)
        idf := math.Log(1 + (float64(index.TotalDocs)-float64(df)+0.5)/(float64(df)+0.5))

        for postID, freq := range posts {
            docLen := float64(index.DocLens[postID])
            score := idf * (float64(freq) * (k1 + 1)) / 
                    (float64(freq) + k1*(1-b+b*(docLen/index.AvgDocLen)))
            scores[postID] += score
        }
    }
}

Scoring Boosts

Match TypeScore BonusUse Case
Phrase match in title+30.0”machine learning” in title
Phrase match in content+15.0Exact phrase anywhere
Title match+10.0Query word in title
Tag match+5.0Query matches tag

Text Analysis Pipeline

Before indexing or searching, all text passes through a standardized analysis pipeline:

1. Tokenization

From builder/search/analyzer.go:105-133:
func TokenizeWithUnicode(text string) []string {
    tokens := make([]string, 0, len(text)/5)
    var buf strings.Builder

    for _, r := range text {
        if unicode.IsLetter(r) || unicode.IsNumber(r) {
            buf.WriteRune(r)
        } else if buf.Len() > 0 {
            tokens = append(tokens, buf.String())
            buf.Reset()
        }
    }
    return tokens
}
This tokenizer:
  • Handles full Unicode (supports non-Latin scripts)
  • Preserves numbers (“HTTP2”, “Base64”)
  • Splits on punctuation and whitespace

2. Stop Word Filtering

Common words like “the”, “and”, “is” are removed to improve relevance. The filter includes 115+ English stop words from builder/search/analyzer.go:9-37.
var stopWords = map[string]bool{
    "a": true, "an": true, "and": true, "are": true,
    "the": true, "their": true, "this": true,
    // ... 115+ total words
}
Stop words are filtered during indexing and search. This means searching for “the quick brown fox” only matches “quick brown fox”.

3. Porter Stemming

The Porter Stemming Algorithm reduces words to their root form, so “running”, “runs”, and “runner” all match “run”. From builder/search/stemmer.go:12-27:
var stemCache sync.Map  // Cache for performance

func StemCached(word string) string {
    if cached, ok := stemCache.Load(word); ok {
        return cached.(string)
    }
    result := stem(word)  // Apply Porter algorithm
    stemCache.Store(word, result)
    return result
}
Stemming examples:
  • "running""run"
  • "argued""argu"
  • "effective""effect"
Stemming is cached using sync.Map for ~76x speedup on repeated words. Most documents reuse common words extensively.

Fuzzy Matching

Fuzzy search corrects typos using Levenshtein edit distance (the minimum number of single-character edits to transform one word into another).

Configuration

From builder/search/fuzzy.go:8:
const MaxEditDistance = 2  // Allow up to 2 character differences
Example matches:
  • "transformr""transformer" (1 deletion)
  • "machien""machine" (1 substitution)
  • "learninng""learning" (1 deletion)

Trigram Optimization

For large indexes, scanning every term is slow. Kosh pre-builds a trigram (3-character n-gram) index to quickly find fuzzy candidates. From builder/search/fuzzy.go:89-116:
func FuzzyExpandWithNgrams(term string, ngramIndex map[string][]string, maxDist int) []string {
    trigrams := generateTrigrams(term)  // "cat" → ["cat"]
                                        // "search" → ["sea", "ear", "arc", "rch"]
    
    candidateScores := make(map[string]int)
    for _, tg := range trigrams {
        if candidates, ok := ngramIndex[tg]; ok {
            for _, cand := range candidates {
                candidateScores[cand]++  // Count shared trigrams
            }
        }
    }

    // Only check edit distance for candidates with sufficient overlap
    for cand, score := range candidateScores {
        if score >= len(trigrams)/2 {
            if FuzzyMatch(term, cand, maxDist) {
                results = append(results, cand)
            }
        }
    }
}
Performance: ~20% faster than brute-force fuzzy matching.

Fuzzy Score Penalty

Fuzzy matches receive a 0.7x score multiplier to rank exact matches higher (from engine.go:42). Enclose queries in quotes for exact phrase matching:
"machine learning"  # Must appear exactly as written
From builder/search/fuzzy.go:173-208:
func ParseQuery(query string) ParsedQuery {
    var phraseBuf strings.Builder
    inPhrase := false

    for _, r := range query {
        if r == '"' {
            if inPhrase {
                phrase := strings.TrimSpace(phraseBuf.String())
                result.Phrases = append(result.Phrases, strings.ToLower(phrase))
                phraseBuf.Reset()
            }
            inPhrase = !inPhrase
        } else if inPhrase {
            phraseBuf.WriteRune(r)
        }
    }
}
Phrase matches receive 2x higher scores than regular matches.

Query Syntax

Kosh supports multiple query types:
QueryBehavior
machine learningTerms search (stemmed, fuzzy-tolerant)
"machine learning"Exact phrase match
tag:transformerFilter by tag
tag:nlp attentionTag filter + terms search

Index Encoding

The search index uses msgpack + gzip for optimal size and speed.

Format Comparison

FormatSizeDecode Speed
GOB285 KB1.0x (baseline)
msgpack~200 KB2.5x faster
Benefits of msgpack:
  • 30% smaller than GOB
  • Language-agnostic (can be read by JS, Python, Rust)
  • Faster deserialization in WASM
From AGENTS.md:
Output Files:
| File | Location | Size |
|------|----------|------|
| WASM source | internal/build/wasm/search.wasm | ~4.2 MB |
| Deployed WASM | public/static/wasm/search.wasm | Extracted from CLI |
| Search index | public/search.bin | ~200 KB (msgpack + gzip) |

Recompiling the WASM Module

If you modify the search engine, you must recompile the WASM module:
# Step 1: Compile WASM
GOOS=js GOARCH=wasm go build -o internal/build/wasm/search.wasm ./cmd/search

# Step 2: Rebuild CLI (embeds WASM)
go build -ldflags="-s -w" -o kosh ./cmd/kosh

# Step 3: Clear cache and rebuild
kosh clean --cache
kosh build

When to Recompile

Recompile needed:
  • Changes to builder/search/*.go
  • Changes to cmd/search/main.go
  • Changes to SearchIndex struct
  • Msgpack version update
No recompile needed:
  • Content changes
  • Theme updates
  • Configuration changes

Performance Characteristics

MetricValue
Index build time~50ms for 100 posts
WASM load time~100ms (first visit)
Search latency<10ms for 1000 posts
Memory usage~5MB (WASM + index)
The entire search stack runs client-side with zero server load. Perfect for static hosting on GitHub Pages, Netlify, or Cloudflare Pages.

Version Scoping

In versioned documentation sites, search results are automatically filtered to the current version:
if versionFilter != "all" && post.Version != versionFilter {
    continue  // Skip posts from other versions
}
Users can search across all versions by changing the version selector to “All Versions”.

Build docs developers (and LLMs) love