porterStemAscii
Reduces a single word to its stem using the Porter stemming algorithm.
Single word token to stem (ASCII text)
Stemmed form of the input token
import { porterStemAscii } from 'bun_nltk';
const stem1 = porterStemAscii("running");
// Returns: "run"
const stem2 = porterStemAscii("flies");
// Returns: "fli"
const stem3 = porterStemAscii("generalization");
// Returns: "gener"
const stem4 = porterStemAscii("connection");
// Returns: "connect"
The Porter stemmer is a rule-based algorithm that removes common morphological suffixes. It’s fast but may produce non-word stems.
porterStemAsciiTokens
Applies Porter stemming to an array of tokens.
Array of word tokens to stem
Array of stemmed tokens in the same order
import { porterStemAsciiTokens } from 'bun_nltk';
const tokens = ["running", "flies", "happily", "connected"];
const stems = porterStemAsciiTokens(tokens);
// Returns: ["run", "fli", "happili", "connect"]
Common Use Cases
Document preprocessing for search:
import { tokenizeAsciiNative, porterStemAsciiTokens } from 'bun_nltk';
const document = "The runners were running quickly through the connected pathways";
const tokens = tokenizeAsciiNative(document);
const stems = porterStemAsciiTokens(tokens);
// Process stems for indexing
Feature extraction for classification:
import { normalizeTokensAsciiNative, porterStemAsciiTokens } from 'bun_nltk';
const text = "Machine learning algorithms are learning from data";
const normalized = normalizeTokensAsciiNative(text);
const stemmed = porterStemAsciiTokens(normalized);
// Returns: ["machin", "learn", "algorithm", "learn", "data"]
// Note: "learning" appears twice with same stem "learn"
Porter stemming is aggressive and may produce stems that aren’t valid English words. For applications requiring valid words, consider using lemmatization instead.