Normalization

Auto-generate your docs

normalizeTokens
normalizeTokensAsciiNative
normalizeTokensUnicode

normalizeTokens

Normalizes tokens from text with optional stopword removal.

text

string

required

Input text to tokenize and normalize

removeStopwords

boolean

default:"true"

Whether to filter out common English stopwords

return

string[]

Array of normalized lowercase tokens

import { normalizeTokens } from 'bun_nltk';

const text = "The quick brown fox jumps over the lazy dog";
const tokens = normalizeTokens(text);
// Returns: ["quick", "brown", "fox", "jumps", "over", "lazy", "dog"]

const withStopwords = normalizeTokens(text, false);
// Returns: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

normalizeTokensAsciiNative

High-performance native implementation for ASCII text normalization.

text

string

required

ASCII input text to tokenize and normalize

removeStopwords

boolean

default:"true"

Whether to filter out common English stopwords

return

string[]

Array of normalized lowercase tokens using SIMD-accelerated processing

import { normalizeTokensAsciiNative } from 'bun_nltk';

const text = "Machine learning is transforming data science";
const tokens = normalizeTokensAsciiNative(text);
// Returns: ["machine", "learning", "transforming", "data", "science"]
// (stopwords "is" removed by default)

The native implementation provides significantly better performance for large texts using SIMD vectorization.

normalizeTokensUnicode

Unicode-aware normalization supporting international text and diacritics.

text

string

required

Unicode text to tokenize and normalize (supports all Unicode scripts)

removeStopwords

boolean

default:"true"

Whether to filter out common English stopwords

return

string[]

Array of normalized tokens with NFKC normalization applied

import { normalizeTokensUnicode } from 'bun_nltk';

const text = "Café résumé naïve coöperate";
const tokens = normalizeTokensUnicode(text);
// Returns: ["café", "résumé", "naïve", "coöperate"]

const multilingual = "机器学习 and 深度学习";
const tokens2 = normalizeTokensUnicode(multilingual);
// Returns: ["机器学习", "深度学习"]

Uses NFKC normalization to handle combining characters and compatibility variants.

Punkt Model

Stemming

⌘I

Build docs developers (and LLMs) love

Get started for free Talk to us

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

normalizeTokens

normalizeTokensAsciiNative

normalizeTokensUnicode

Build docs developers (and LLMs) love

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

​normalizeTokens

​normalizeTokensAsciiNative

​normalizeTokensUnicode

Build docs developers (and LLMs) love

normalizeTokens

normalizeTokensAsciiNative

normalizeTokensUnicode