WASM Methods - bun

Overview

After initializing a WasmNltk instance, you can use these methods for high-performance text processing operations.

Tokenization

countTokensAscii()

Counts the number of tokens in ASCII text.

countTokensAscii(text: string): number

Example

const wasm = await WasmNltk.init();
const count = wasm.countTokensAscii("Hello, world! How are you?");
console.log(count); // 5

tokenizeAscii()

Tokenizes ASCII text into an array of lowercase tokens.

tokenizeAscii(text: string): string[]

Example

const tokens = wasm.tokenizeAscii("Hello, world!");
console.log(tokens); // ["hello", "world"]

tokenOffsetsAscii()

Returns token offsets and lengths for zero-copy tokenization.

tokenOffsetsAscii(text: string): {
  total: number;
  offsets: Uint32Array;
  lengths: Uint32Array;
  input: Uint8Array;
}

Example

const result = wasm.tokenOffsetsAscii("Hello world");
console.log(result.total); // 2
console.log(result.offsets); // Uint32Array [0, 6]
console.log(result.lengths); // Uint32Array [5, 5]

normalizeTokensAscii()

Normalizes tokens (lowercase, filter stopwords, remove punctuation).

normalizeTokensAscii(text: string, removeStopwords?: boolean): string[]

Parameters

text

string

required

Input text to normalize

removeStopwords

boolean

default:"true"

Whether to remove common English stopwords

Example

const normalized = wasm.normalizeTokensAscii("The quick brown fox", true);
console.log(normalized); // ["quick", "brown", "fox"] ("the" removed)

normalizedTokenOffsetsAscii()

Returns normalized token offsets for zero-copy processing.

normalizedTokenOffsetsAscii(text: string, removeStopwords?: boolean): {
  total: number;
  offsets: Uint32Array;
  lengths: Uint32Array;
  input: Uint8Array;
}

Sentence Tokenization

sentenceTokenizePunktAscii()

Tokenizes text into sentences using the Punkt algorithm.

sentenceTokenizePunktAscii(text: string): string[]

Example

const sentences = wasm.sentenceTokenizePunktAscii(
  "Hello world. How are you? I'm fine."
);
console.log(sentences);
// ["Hello world.", "How are you?", "I'm fine."]

N-grams & Metrics

countNgramsAscii()

Counts the total number of n-grams in the text.

countNgramsAscii(text: string, n: number): number

Example

const count = wasm.countNgramsAscii("one two three four", 2);
console.log(count); // 3 ("one two", "two three", "three four")

computeAsciiMetrics()

Computes comprehensive text metrics in a single pass.

computeAsciiMetrics(text: string, n: number): AsciiMetrics

AsciiMetrics Type

type AsciiMetrics = {
  tokens: number;         // Total token count
  uniqueTokens: number;   // Unique token count
  ngrams: number;         // Total n-gram count
  uniqueNgrams: number;   // Unique n-gram count
};

Example

const metrics = wasm.computeAsciiMetrics("one two three two one", 2);
console.log(metrics);
// {
//   tokens: 5,
//   uniqueTokens: 3,
//   ngrams: 4,
//   uniqueNgrams: 4
// }

Machine Learning

perceptronPredictBatch()

Performs batch prediction using a perceptron model.

perceptronPredictBatch(
  featureIds: Uint32Array,
  tokenOffsets: Uint32Array,
  weights: Float32Array,
  modelFeatureCount: number,
  tagCount: number
): Uint16Array

Parameters

featureIds

Uint32Array

required

Feature IDs for all tokens (flat array)

tokenOffsets

Uint32Array

required

Offsets into featureIds array for each token (length = tokenCount + 1)

weights

Float32Array

required

Model weights (shape: [modelFeatureCount, tagCount])

modelFeatureCount

number

required

Number of features in the model

tagCount

number

required

Number of possible tags

Returns

Returns Uint16Array of predicted tag IDs (length = tokenCount).

naiveBayesLogScoresIds()

Computes Naive Bayes log-probability scores for classification.

naiveBayesLogScoresIds(input: {
  docTokenIds: Uint32Array;
  vocabSize: number;
  tokenCountsMatrix: Uint32Array;
  labelDocCounts: Uint32Array;
  labelTokenTotals: Uint32Array;
  totalDocs: number;
  smoothing: number;
}): Float64Array

Returns

Returns Float64Array of log scores for each label.

Language Modeling

evaluateLanguageModelIds()

Evaluates an n-gram language model.

evaluateLanguageModelIds(input: {
  tokenIds: Uint32Array;
  sentenceOffsets: Uint32Array;
  order: number;
  model: WasmLmModelType;
  gamma: number;
  discount: number;
  vocabSize: number;
  probeContextFlat: Uint32Array;
  probeContextLens: Uint32Array;
  probeWordIds: Uint32Array;
  perplexityTokenIds: Uint32Array;
  prefixTokenIds: Uint32Array;
}): { scores: Float64Array; perplexity: number }

WasmLmModelType

type WasmLmModelType = "mle" | "lidstone" | "kneser_ney_interpolated";

Parameters

order

number

required

N-gram order (e.g., 2 for bigram, 3 for trigram)

model

WasmLmModelType

required

Language model type

gamma

number

required

Smoothing parameter (for Lidstone)

discount

number

required

Discount parameter (for Kneser-Ney)

Returns

Returns an object with:

scores: Float64Array of probability scores for probe words
perplexity: Overall perplexity score

Parsing & Chunking

chunkIobIds()

Performs IOB chunking on tagged tokens.

chunkIobIds(input: {
  tokenTagIds: Uint16Array;
  atomAllowedOffsets: Uint32Array;
  atomAllowedLengths: Uint32Array;
  atomAllowedFlat: Uint16Array;
  atomMins: Uint8Array;
  atomMaxs: Uint8Array;
  ruleAtomOffsets: Uint32Array;
  ruleAtomCounts: Uint32Array;
  ruleLabelIds: Uint16Array;
}): { labelIds: Uint16Array; begins: Uint8Array }

cykRecognizeIds()

Performs CYK parsing recognition on a token sequence.

cykRecognizeIds(input: {
  tokenBits: BigUint64Array;
  binaryLeft: Uint16Array;
  binaryRight: Uint16Array;
  binaryParent: Uint16Array;
  unaryChild: Uint16Array;
  unaryParent: Uint16Array;
  startSymbol: number;
}): boolean

Returns

Returns true if the input is recognized by the grammar, false otherwise.

WordNet

wordnetMorphyAscii()

Finds the morphological root form of a word.

wordnetMorphyAscii(word: string, pos?: "n" | "v" | "a" | "r"): string

Parameters

word

string

required

Word to find the root form of

pos

'n' | 'v' | 'a' | 'r'

Part of speech: noun (n), verb (v), adjective (a), or adverb (r)

Example

const root = wasm.wordnetMorphyAscii("running", "v");
console.log(root); // "run"

const root2 = wasm.wordnetMorphyAscii("geese", "n");
console.log(root2); // "goose"

Performance Considerations

Memory Reuse

The WASM runtime reuses memory blocks across operations. Frequent calls to the same method type will benefit from this optimization:

// Efficient: Memory blocks reused
for (const text of texts) {
  const tokens = wasm.tokenizeAscii(text);
  processTokens(tokens);
}

Zero-Copy Operations

Use offset-based methods for zero-copy processing:

const { offsets, lengths, input } = wasm.tokenOffsetsAscii(text);

// Process tokens without creating strings
for (let i = 0; i < offsets.length; i++) {
  const start = offsets[i];
  const len = lengths[i];
  // Work directly with input.subarray(start, start + len)
}

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

​Overview

​Tokenization

​countTokensAscii()

​Example

​tokenizeAscii()

​Example

​tokenOffsetsAscii()

​Example

​normalizeTokensAscii()

​Parameters

​Example

​normalizedTokenOffsetsAscii()

​Sentence Tokenization

​sentenceTokenizePunktAscii()

​Example

​N-grams & Metrics

​countNgramsAscii()

​Example

​computeAsciiMetrics()

​AsciiMetrics Type

​Example

​Machine Learning

​perceptronPredictBatch()

​Parameters

​Returns

​naiveBayesLogScoresIds()

​Returns

​Language Modeling

​evaluateLanguageModelIds()

​WasmLmModelType

​Parameters

​Returns

​Parsing & Chunking

​chunkIobIds()

​cykRecognizeIds()

​Returns

​WordNet

​wordnetMorphyAscii()

​Parameters

​Example

​Performance Considerations

​Memory Reuse

​Zero-Copy Operations

​See Also

Build docs developers (and LLMs) love

Overview

Tokenization

countTokensAscii()

Example

tokenizeAscii()

Example

tokenOffsetsAscii()

Example

normalizeTokensAscii()

Parameters

Example

normalizedTokenOffsetsAscii()

Sentence Tokenization

sentenceTokenizePunktAscii()

Example

N-grams & Metrics

countNgramsAscii()

Example

computeAsciiMetrics()

AsciiMetrics Type

Example

Machine Learning

perceptronPredictBatch()

Parameters

Returns

naiveBayesLogScoresIds()

Returns

Language Modeling

evaluateLanguageModelIds()

WasmLmModelType

Parameters

Returns

Parsing & Chunking

chunkIobIds()

cykRecognizeIds()

Returns

WordNet

wordnetMorphyAscii()

Parameters

Example

Performance Considerations

Memory Reuse

Zero-Copy Operations

See Also