Skip to main content

normalizeTokens

Normalizes tokens from text with optional stopword removal.
text
string
required
Input text to tokenize and normalize
removeStopwords
boolean
default:"true"
Whether to filter out common English stopwords
return
string[]
Array of normalized lowercase tokens
import { normalizeTokens } from 'bun_nltk';

const text = "The quick brown fox jumps over the lazy dog";
const tokens = normalizeTokens(text);
// Returns: ["quick", "brown", "fox", "jumps", "over", "lazy", "dog"]

const withStopwords = normalizeTokens(text, false);
// Returns: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

normalizeTokensAsciiNative

High-performance native implementation for ASCII text normalization.
text
string
required
ASCII input text to tokenize and normalize
removeStopwords
boolean
default:"true"
Whether to filter out common English stopwords
return
string[]
Array of normalized lowercase tokens using SIMD-accelerated processing
import { normalizeTokensAsciiNative } from 'bun_nltk';

const text = "Machine learning is transforming data science";
const tokens = normalizeTokensAsciiNative(text);
// Returns: ["machine", "learning", "transforming", "data", "science"]
// (stopwords "is" removed by default)
The native implementation provides significantly better performance for large texts using SIMD vectorization.

normalizeTokensUnicode

Unicode-aware normalization supporting international text and diacritics.
text
string
required
Unicode text to tokenize and normalize (supports all Unicode scripts)
removeStopwords
boolean
default:"true"
Whether to filter out common English stopwords
return
string[]
Array of normalized tokens with NFKC normalization applied
import { normalizeTokensUnicode } from 'bun_nltk';

const text = "Café résumé naïve coöperate";
const tokens = normalizeTokensUnicode(text);
// Returns: ["café", "résumé", "naïve", "coöperate"]

const multilingual = "机器学习 and 深度学习";
const tokens2 = normalizeTokensUnicode(multilingual);
// Returns: ["机器学习", "深度学习"]
Uses NFKC normalization to handle combining characters and compatibility variants.

Build docs developers (and LLMs) love