WASM Search Engine

Kosh’s search engine runs entirely in the browser using WebAssembly, providing instant full-text search without server infrastructure. Built with Go and compiled to WASM, it delivers industry-standard relevance ranking with advanced features like typo tolerance and phrase matching.

Architecture

The search engine consists of three core components:

builder/search/
├── engine.go       # BM25 scoring and search orchestration
├── analyzer.go     # Text tokenization and normalization
├── stemmer.go      # Porter stemmer for English
└── fuzzy.go        # Levenshtein distance and phrase parsing

At build time, Kosh indexes all content and generates a compact binary index (search.bin). The WASM module (search.wasm) loads this index in the browser and performs all searches client-side.

BM25 Relevance Scoring

Kosh uses BM25 (Best Matching 25), the industry-standard probabilistic ranking function used by Elasticsearch and Lucene. Unlike simple keyword matching, BM25 considers:

Term Frequency (TF): How often a word appears in a document
Inverse Document Frequency (IDF): How rare a word is across all documents
Document Length Normalization: Prevents longer documents from dominating results

Implementation

From builder/search/engine.go:93-115:

k1 := 1.2  // Term frequency saturation parameter
b := 0.75  // Document length normalization

for _, term := range queryTerms {
    if posts, ok := index.Inverted[term]; ok {
        df := len(posts)
        idf := math.Log(1 + (float64(index.TotalDocs)-float64(df)+0.5)/(float64(df)+0.5))

        for postID, freq := range posts {
            docLen := float64(index.DocLens[postID])
            score := idf * (float64(freq) * (k1 + 1)) / 
                    (float64(freq) + k1*(1-b+b*(docLen/index.AvgDocLen)))
            scores[postID] += score
        }
    }
}

Scoring Boosts

Match Type	Score Bonus	Use Case
Phrase match in title	+30.0	”machine learning” in title
Phrase match in content	+15.0	Exact phrase anywhere
Title match	+10.0	Query word in title
Tag match	+5.0	Query matches tag

Text Analysis Pipeline

Before indexing or searching, all text passes through a standardized analysis pipeline:

1. Tokenization

From builder/search/analyzer.go:105-133:

func TokenizeWithUnicode(text string) []string {
    tokens := make([]string, 0, len(text)/5)
    var buf strings.Builder

    for _, r := range text {
        if unicode.IsLetter(r) || unicode.IsNumber(r) {
            buf.WriteRune(r)
        } else if buf.Len() > 0 {
            tokens = append(tokens, buf.String())
            buf.Reset()
        }
    }
    return tokens
}

This tokenizer:

Handles full Unicode (supports non-Latin scripts)
Preserves numbers (“HTTP2”, “Base64”)
Splits on punctuation and whitespace

2. Stop Word Filtering

Common words like “the”, “and”, “is” are removed to improve relevance. The filter includes 115+ English stop words from builder/search/analyzer.go:9-37.

var stopWords = map[string]bool{
    "a": true, "an": true, "and": true, "are": true,
    "the": true, "their": true, "this": true,
    // ... 115+ total words
}

Stop words are filtered during indexing and search. This means searching for “the quick brown fox” only matches “quick brown fox”.

3. Porter Stemming

The Porter Stemming Algorithm reduces words to their root form, so “running”, “runs”, and “runner” all match “run”. From builder/search/stemmer.go:12-27:

var stemCache sync.Map  // Cache for performance

func StemCached(word string) string {
    if cached, ok := stemCache.Load(word); ok {
        return cached.(string)
    }
    result := stem(word)  // Apply Porter algorithm
    stemCache.Store(word, result)
    return result
}

Stemming examples:

"running" → "run"
"argued" → "argu"
"effective" → "effect"

Stemming is cached using sync.Map for ~76x speedup on repeated words. Most documents reuse common words extensively.

Fuzzy Matching

Fuzzy search corrects typos using Levenshtein edit distance (the minimum number of single-character edits to transform one word into another).

Configuration

From builder/search/fuzzy.go:8:

const MaxEditDistance = 2  // Allow up to 2 character differences

Example matches:

"transformr" → "transformer" (1 deletion)
"machien" → "machine" (1 substitution)
"learninng" → "learning" (1 deletion)

Trigram Optimization

For large indexes, scanning every term is slow. Kosh pre-builds a trigram (3-character n-gram) index to quickly find fuzzy candidates. From builder/search/fuzzy.go:89-116:

func FuzzyExpandWithNgrams(term string, ngramIndex map[string][]string, maxDist int) []string {
    trigrams := generateTrigrams(term)  // "cat" → ["cat"]
                                        // "search" → ["sea", "ear", "arc", "rch"]
    
    candidateScores := make(map[string]int)
    for _, tg := range trigrams {
        if candidates, ok := ngramIndex[tg]; ok {
            for _, cand := range candidates {
                candidateScores[cand]++  // Count shared trigrams
            }
        }
    }

    // Only check edit distance for candidates with sufficient overlap
    for cand, score := range candidateScores {
        if score >= len(trigrams)/2 {
            if FuzzyMatch(term, cand, maxDist) {
                results = append(results, cand)
            }
        }
    }
}

Performance: ~20% faster than brute-force fuzzy matching.

Fuzzy Score Penalty

Fuzzy matches receive a 0.7x score multiplier to rank exact matches higher (from engine.go:42).

Phrase Search

Enclose queries in quotes for exact phrase matching:

"machine learning"  # Must appear exactly as written

From builder/search/fuzzy.go:173-208:

func ParseQuery(query string) ParsedQuery {
    var phraseBuf strings.Builder
    inPhrase := false

    for _, r := range query {
        if r == '"' {
            if inPhrase {
                phrase := strings.TrimSpace(phraseBuf.String())
                result.Phrases = append(result.Phrases, strings.ToLower(phrase))
                phraseBuf.Reset()
            }
            inPhrase = !inPhrase
        } else if inPhrase {
            phraseBuf.WriteRune(r)
        }
    }
}

Phrase matches receive 2x higher scores than regular matches.

Query Syntax

Kosh supports multiple query types:

Query	Behavior
`machine learning`	Terms search (stemmed, fuzzy-tolerant)
`"machine learning"`	Exact phrase match
`tag:transformer`	Filter by tag
`tag:nlp attention`	Tag filter + terms search

Index Encoding

The search index uses msgpack + gzip for optimal size and speed.

Format Comparison

Format	Size	Decode Speed
GOB	285 KB	1.0x (baseline)
msgpack	~200 KB	2.5x faster

Benefits of msgpack:

30% smaller than GOB
Language-agnostic (can be read by JS, Python, Rust)
Faster deserialization in WASM

From AGENTS.md:

Output Files:
| File | Location | Size |
|------|----------|------|
| WASM source | internal/build/wasm/search.wasm | ~4.2 MB |
| Deployed WASM | public/static/wasm/search.wasm | Extracted from CLI |
| Search index | public/search.bin | ~200 KB (msgpack + gzip) |

Recompiling the WASM Module

If you modify the search engine, you must recompile the WASM module:

# Step 1: Compile WASM
GOOS=js GOARCH=wasm go build -o internal/build/wasm/search.wasm ./cmd/search

# Step 2: Rebuild CLI (embeds WASM)
go build -ldflags="-s -w" -o kosh ./cmd/kosh

# Step 3: Clear cache and rebuild
kosh clean --cache
kosh build

When to Recompile

✅ Recompile needed:

Changes to builder/search/*.go
Changes to cmd/search/main.go
Changes to SearchIndex struct
Msgpack version update

❌ No recompile needed:

Content changes
Theme updates
Configuration changes

Performance Characteristics

Metric	Value
Index build time	~50ms for 100 posts
WASM load time	~100ms (first visit)
Search latency	<10ms for 1000 posts
Memory usage	~5MB (WASM + index)

The entire search stack runs client-side with zero server load. Perfect for static hosting on GitHub Pages, Netlify, or Cloudflare Pages.

Version Scoping

In versioned documentation sites, search results are automatically filtered to the current version:

if versionFilter != "all" && post.Version != versionFilter {
    continue  // Skip posts from other versions
}

Users can search across all versions by changing the version selector to “All Versions”.

Get Started

Core Concepts

Usage

Features

Development

WASM Search Engine

Architecture

BM25 Relevance Scoring

Implementation

Scoring Boosts

Text Analysis Pipeline

1. Tokenization

2. Stop Word Filtering

3. Porter Stemming

Fuzzy Matching

Configuration

Trigram Optimization

Fuzzy Score Penalty

Phrase Search

Query Syntax

Index Encoding

Format Comparison

Recompiling the WASM Module

When to Recompile

Performance Characteristics

Version Scoping

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage

Features

Development

​Architecture

​BM25 Relevance Scoring

​Implementation

​Scoring Boosts

​Text Analysis Pipeline

​1. Tokenization

​2. Stop Word Filtering

​3. Porter Stemming

​Fuzzy Matching

​Configuration

​Trigram Optimization

​Fuzzy Score Penalty

​Phrase Search

​Query Syntax

​Index Encoding

​Format Comparison

​Recompiling the WASM Module

​When to Recompile

​Performance Characteristics

​Version Scoping

Build docs developers (and LLMs) love

Architecture

BM25 Relevance Scoring

Implementation

Scoring Boosts

Text Analysis Pipeline

1. Tokenization

2. Stop Word Filtering

3. Porter Stemming

Fuzzy Matching

Configuration

Trigram Optimization

Fuzzy Score Penalty

Phrase Search

Query Syntax

Index Encoding

Format Comparison

Recompiling the WASM Module

When to Recompile

Performance Characteristics

Version Scoping