Stemming

Stemming reduces words to their base or root form by removing suffixes. bun_nltk implements the Porter stemming algorithm with optimized native performance.

Quick Start

import { porterStemAscii } from "bun_nltk";

const word = "running";
const stem = porterStemAscii(word);
console.log(stem); // "run"

Porter Stemmer

The Porter stemmer is the most widely-used stemming algorithm for English.

Single Word Stemming

import { porterStemAscii } from "bun_nltk";

const examples = [
  "running",    // → "run"
  "flies",      // → "fli"
  "happily",    // → "happili"
  "organization", // → "organ"
  "studies",    // → "studi"
];

for (const word of examples) {
  const stem = porterStemAscii(word);
  console.log(`${word} → ${stem}`);
}

Signature:

function porterStemAscii(token: string): string

Input: Any ASCII word (lowercase recommended)
Output: Stemmed form of the word
Performance: Uses optimized native implementation

Stem Verbs

import { porterStemAscii } from "bun_nltk";

const verbs = [
  "running",   // → "run"
  "walked",    // → "walk"
  "flies",     // → "fli"
  "swimming",  // → "swim"
  "played",    // → "play"
];

const stems = verbs.map(porterStemAscii);
console.log(stems);

Stem Nouns

import { porterStemAscii } from "bun_nltk";

const nouns = [
  "cats",       // → "cat"
  "ponies",     // → "poni"
  "cities",     // → "citi"
  "churches",   // → "church"
];

const stems = nouns.map(porterStemAscii);
console.log(stems);

Stem Adjectives

import { porterStemAscii } from "bun_nltk";

const adjectives = [
  "happier",    // → "happier"
  "happiest",   // → "happiest"
  "beautiful",  // → "beauti"
  "careful",    // → "care"
];

const stems = adjectives.map(porterStemAscii);
console.log(stems);

Batch Operations

Process multiple words efficiently.

Stem Token Array

import { porterStemAsciiTokens } from "bun_nltk";

const tokens = ["running", "jumps", "played", "swimming"];
const stems = porterStemAsciiTokens(tokens);
console.log(stems);
// ["run", "jump", "play", "swim"]

Signature:

function porterStemAsciiTokens(tokens: string[]): string[]

Pipeline with Tokenization

import { tokenizeAsciiNative, porterStemAsciiTokens } from "bun_nltk";

const text = "The cats are running and jumping";

// Tokenize
const tokens = tokenizeAsciiNative(text);
// ["the", "cats", "are", "running", "and", "jumping"]

// Stem
const stems = porterStemAsciiTokens(tokens);
// ["the", "cat", "are", "run", "and", "jump"]

Process Documents

import { tokenizeAsciiNative, porterStemAsciiTokens } from "bun_nltk";

const documents = [
  "The dogs are running in circles",
  "Cats jump over fences",
  "Birds fly through the skies"
];

const stemmedDocs = documents.map(doc => {
  const tokens = tokenizeAsciiNative(doc);
  const stems = porterStemAsciiTokens(tokens);
  return stems.join(" ");
});

console.log(stemmedDocs);
// [
//   "the dog are run in circl",
//   "cat jump over fenc",  
//   "bird fli through the ski"
// ]

Common Use Cases

Search Query Normalization

import { tokenizeAsciiNative, porterStemAsciiTokens } from "bun_nltk";

function normalizeQuery(query: string): string[] {
  const tokens = tokenizeAsciiNative(query);
  return porterStemAsciiTokens(tokens);
}

const userQuery = "running shoes for athletes";
const normalized = normalizeQuery(userQuery);
console.log(normalized);
// ["run", "shoe", "for", "athlet"]

// Index documents with the same stemming
const document = "Best running shoe for professional athletes";
const docTerms = normalizeQuery(document);
// ["best", "run", "shoe", "for", "profession", "athlet"]

// Both contain: ["run", "shoe", "athlet"]

Text Clustering

import { tokenizeAsciiNative, porterStemAsciiTokens } from "bun_nltk";

const documents = [
  "machine learning algorithms",
  "learning machines and algorithms",
  "algorithmic machine learners"
];

const normalized = documents.map(doc => {
  const tokens = tokenizeAsciiNative(doc);
  const stems = porterStemAsciiTokens(tokens);
  return new Set(stems);
});

// All contain: {"machin", "learn", "algorithm"}
// Documents are similar despite different word forms

Term Frequency Analysis

import { tokenizeAsciiNative, porterStemAsciiTokens } from "bun_nltk";

const text = `
  The runner ran quickly. Running is healthy.
  Runners often run marathons. Many people enjoy running.
`;

const tokens = tokenizeAsciiNative(text);
const stems = porterStemAsciiTokens(tokens);

// Count stem frequencies
const freq = new Map<string, number>();
for (const stem of stems) {
  freq.set(stem, (freq.get(stem) || 0) + 1);
}

console.log(freq.get("run")); // 5
// Counts: "runner", "ran", "running", "runners", "run"

Information Retrieval

import { tokenizeAsciiNative, porterStemAsciiTokens } from "bun_nltk";

function createInvertedIndex(documents: string[]) {
  const index = new Map<string, Set<number>>();
  
  for (const [docId, doc] of documents.entries()) {
    const tokens = tokenizeAsciiNative(doc);
    const stems = porterStemAsciiTokens(tokens);
    
    for (const stem of new Set(stems)) {
      if (!index.has(stem)) {
        index.set(stem, new Set());
      }
      index.get(stem)!.add(docId);
    }
  }
  
  return index;
}

const docs = [
  "cats and dogs playing",
  "the cat plays with toys",
  "dogs play in the park"
];

const index = createInvertedIndex(docs);
console.log(index.get("play")); // Set { 0, 1, 2 }
// Matches: "playing", "plays", "play"

Stemming Rules

The Porter algorithm applies transformations in multiple steps:

Step 1: Plurals and -ed/-ing

// Examples
"caresses"  → "caress"   // -es removal
"ponies"    → "poni"     // -ies → -i
"cats"      → "cat"      // -s removal  
"agreed"    → "agree"    // -ed removal
"running"   → "run"      // -ing removal

Step 2-4: Derivational Suffixes

"relational" → "relate"   // -ational → -ate
"conditional" → "condition" // -ional → -ion
"national"  → "nation"   // -al removal
"activate"  → "activ"    // -ate removal

Step 5: Final cleanup

"probate"   → "probat"   // -e removal
"rate"      → "rate"     // keeps short -e
"cease"     → "ceas"     // -e removal

The Porter stemmer may produce stems that are not valid English words (e.g., “happili”, “fli”). This is expected behavior.

Integration with Other Features

With Text Normalization

import { normalizeTokens, porterStemAsciiTokens } from "bun_nltk";

const text = "The running cats and jumping dogs";

// Normalize (tokenize + remove stopwords)
const normalized = normalizeTokens(text, {
  removeStopwords: true,
  stem: false  // Don't stem yet
});
// ["running", "cats", "jumping", "dogs"]

// Then stem
const stems = porterStemAsciiTokens(normalized);
// ["run", "cat", "jump", "dog"]

// Or use built-in stemming option
const combined = normalizeTokens(text, {
  removeStopwords: true,
  stem: true  // Normalize + stem in one step
});
// ["run", "cat", "jump", "dog"]

With POS Tagging

import { posTagAsciiNative, porterStemAscii } from "bun_nltk";

const text = "The cats are running";
const tags = posTagAsciiNative(text);

// Stem only nouns and verbs
const stemmed = tags.map(tag => {
  if (tag.tag.startsWith("NN") || tag.tag.startsWith("VB")) {
    return porterStemAscii(tag.token.toLowerCase());
  }
  return tag.token.toLowerCase();
});

console.log(stemmed);
// ["the", "cat", "are", "run"]

Performance Tips

Use Batch Function

// Faster
const stems = porterStemAsciiTokens(tokens);

// Slower (creates more function call overhead)
const stems = tokens.map(porterStemAscii);

Lowercase Input

// Porter expects lowercase
const text = "Running QUICKLY";
const tokens = tokenizeAsciiNative(text); // Already lowercases
const stems = porterStemAsciiTokens(tokens);

Cache Results

const stemCache = new Map<string, string>();

function cachedStem(word: string): string {
  if (!stemCache.has(word)) {
    stemCache.set(word, porterStemAscii(word));
  }
  return stemCache.get(word)!;
}

Limitations

Stemming is aggressive and can cause over-stemming:

“organization” → “organ” (loses meaning)
“university” → “univers” (not a word)
“better” → “better” (irregular forms not handled)

Consider using lemmatization for more accurate results when exact word forms matter.

Over-stemming Examples

import { porterStemAscii } from "bun_nltk";

const words = [
  "universal",   // → "univers"
  "university",  // → "univers"  (same stem, different meanings)
  "organization", // → "organ"    (very different meaning)
  "news",        // → "new"      (loses plural sense)
];

for (const word of words) {
  console.log(`${word} → ${porterStemAscii(word)}`);
}

Under-stemming Examples

import { porterStemAscii } from "bun_nltk";

const words = [
  "good",        // → "good"
  "better",      // → "better"  (should match "good")
  "best",        // → "best"    (should match "good")
];

// Irregular forms require lemmatization

For search applications, stemming is usually good enough and much faster than lemmatization. For linguistic analysis, consider lemmatization.

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

Quick Start

Porter Stemmer

Single Word Stemming

Batch Operations

Stem Token Array

Pipeline with Tokenization

Process Documents

Common Use Cases

Search Query Normalization

Text Clustering

Term Frequency Analysis

Information Retrieval

Stemming Rules

Step 1: Plurals and -ed/-ing

Step 2-4: Derivational Suffixes

Step 5: Final cleanup

Integration with Other Features

With Text Normalization

With POS Tagging

Performance Tips

Limitations

Over-stemming Examples

Under-stemming Examples

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

​Quick Start

​Porter Stemmer

​Single Word Stemming

​Batch Operations

​Stem Token Array

​Pipeline with Tokenization

​Process Documents

​Common Use Cases

​Search Query Normalization

​Text Clustering

​Term Frequency Analysis

​Information Retrieval

​Stemming Rules

​Step 1: Plurals and -ed/-ing

​Step 2-4: Derivational Suffixes

​Step 5: Final cleanup

​Integration with Other Features

​With Text Normalization

​With POS Tagging

​Performance Tips

​Limitations

​Over-stemming Examples

​Under-stemming Examples

Build docs developers (and LLMs) love

Quick Start

Porter Stemmer

Single Word Stemming

Batch Operations

Stem Token Array

Pipeline with Tokenization

Process Documents

Common Use Cases

Search Query Normalization

Text Clustering

Term Frequency Analysis

Information Retrieval

Stemming Rules

Step 1: Plurals and -ed/-ing

Step 2-4: Derivational Suffixes

Step 5: Final cleanup

Integration with Other Features

With Text Normalization

With POS Tagging

Performance Tips

Limitations

Over-stemming Examples

Under-stemming Examples