Skip to main content

Overview

After initializing a WasmNltk instance, you can use these methods for high-performance text processing operations.

Tokenization

countTokensAscii()

Counts the number of tokens in ASCII text.
countTokensAscii(text: string): number

Example

const wasm = await WasmNltk.init();
const count = wasm.countTokensAscii("Hello, world! How are you?");
console.log(count); // 5

tokenizeAscii()

Tokenizes ASCII text into an array of lowercase tokens.
tokenizeAscii(text: string): string[]

Example

const tokens = wasm.tokenizeAscii("Hello, world!");
console.log(tokens); // ["hello", "world"]

tokenOffsetsAscii()

Returns token offsets and lengths for zero-copy tokenization.
tokenOffsetsAscii(text: string): {
  total: number;
  offsets: Uint32Array;
  lengths: Uint32Array;
  input: Uint8Array;
}

Example

const result = wasm.tokenOffsetsAscii("Hello world");
console.log(result.total); // 2
console.log(result.offsets); // Uint32Array [0, 6]
console.log(result.lengths); // Uint32Array [5, 5]

normalizeTokensAscii()

Normalizes tokens (lowercase, filter stopwords, remove punctuation).
normalizeTokensAscii(text: string, removeStopwords?: boolean): string[]

Parameters

text
string
required
Input text to normalize
removeStopwords
boolean
default:"true"
Whether to remove common English stopwords

Example

const normalized = wasm.normalizeTokensAscii("The quick brown fox", true);
console.log(normalized); // ["quick", "brown", "fox"] ("the" removed)

normalizedTokenOffsetsAscii()

Returns normalized token offsets for zero-copy processing.
normalizedTokenOffsetsAscii(text: string, removeStopwords?: boolean): {
  total: number;
  offsets: Uint32Array;
  lengths: Uint32Array;
  input: Uint8Array;
}

Sentence Tokenization

sentenceTokenizePunktAscii()

Tokenizes text into sentences using the Punkt algorithm.
sentenceTokenizePunktAscii(text: string): string[]

Example

const sentences = wasm.sentenceTokenizePunktAscii(
  "Hello world. How are you? I'm fine."
);
console.log(sentences);
// ["Hello world.", "How are you?", "I'm fine."]

N-grams & Metrics

countNgramsAscii()

Counts the total number of n-grams in the text.
countNgramsAscii(text: string, n: number): number

Example

const count = wasm.countNgramsAscii("one two three four", 2);
console.log(count); // 3 ("one two", "two three", "three four")

computeAsciiMetrics()

Computes comprehensive text metrics in a single pass.
computeAsciiMetrics(text: string, n: number): AsciiMetrics

AsciiMetrics Type

type AsciiMetrics = {
  tokens: number;         // Total token count
  uniqueTokens: number;   // Unique token count
  ngrams: number;         // Total n-gram count
  uniqueNgrams: number;   // Unique n-gram count
};

Example

const metrics = wasm.computeAsciiMetrics("one two three two one", 2);
console.log(metrics);
// {
//   tokens: 5,
//   uniqueTokens: 3,
//   ngrams: 4,
//   uniqueNgrams: 4
// }

Machine Learning

perceptronPredictBatch()

Performs batch prediction using a perceptron model.
perceptronPredictBatch(
  featureIds: Uint32Array,
  tokenOffsets: Uint32Array,
  weights: Float32Array,
  modelFeatureCount: number,
  tagCount: number
): Uint16Array

Parameters

featureIds
Uint32Array
required
Feature IDs for all tokens (flat array)
tokenOffsets
Uint32Array
required
Offsets into featureIds array for each token (length = tokenCount + 1)
weights
Float32Array
required
Model weights (shape: [modelFeatureCount, tagCount])
modelFeatureCount
number
required
Number of features in the model
tagCount
number
required
Number of possible tags

Returns

Returns Uint16Array of predicted tag IDs (length = tokenCount).

naiveBayesLogScoresIds()

Computes Naive Bayes log-probability scores for classification.
naiveBayesLogScoresIds(input: {
  docTokenIds: Uint32Array;
  vocabSize: number;
  tokenCountsMatrix: Uint32Array;
  labelDocCounts: Uint32Array;
  labelTokenTotals: Uint32Array;
  totalDocs: number;
  smoothing: number;
}): Float64Array

Returns

Returns Float64Array of log scores for each label.

Language Modeling

evaluateLanguageModelIds()

Evaluates an n-gram language model.
evaluateLanguageModelIds(input: {
  tokenIds: Uint32Array;
  sentenceOffsets: Uint32Array;
  order: number;
  model: WasmLmModelType;
  gamma: number;
  discount: number;
  vocabSize: number;
  probeContextFlat: Uint32Array;
  probeContextLens: Uint32Array;
  probeWordIds: Uint32Array;
  perplexityTokenIds: Uint32Array;
  prefixTokenIds: Uint32Array;
}): { scores: Float64Array; perplexity: number }

WasmLmModelType

type WasmLmModelType = "mle" | "lidstone" | "kneser_ney_interpolated";

Parameters

order
number
required
N-gram order (e.g., 2 for bigram, 3 for trigram)
model
WasmLmModelType
required
Language model type
gamma
number
required
Smoothing parameter (for Lidstone)
discount
number
required
Discount parameter (for Kneser-Ney)

Returns

Returns an object with:
  • scores: Float64Array of probability scores for probe words
  • perplexity: Overall perplexity score

Parsing & Chunking

chunkIobIds()

Performs IOB chunking on tagged tokens.
chunkIobIds(input: {
  tokenTagIds: Uint16Array;
  atomAllowedOffsets: Uint32Array;
  atomAllowedLengths: Uint32Array;
  atomAllowedFlat: Uint16Array;
  atomMins: Uint8Array;
  atomMaxs: Uint8Array;
  ruleAtomOffsets: Uint32Array;
  ruleAtomCounts: Uint32Array;
  ruleLabelIds: Uint16Array;
}): { labelIds: Uint16Array; begins: Uint8Array }

cykRecognizeIds()

Performs CYK parsing recognition on a token sequence.
cykRecognizeIds(input: {
  tokenBits: BigUint64Array;
  binaryLeft: Uint16Array;
  binaryRight: Uint16Array;
  binaryParent: Uint16Array;
  unaryChild: Uint16Array;
  unaryParent: Uint16Array;
  startSymbol: number;
}): boolean

Returns

Returns true if the input is recognized by the grammar, false otherwise.

WordNet

wordnetMorphyAscii()

Finds the morphological root form of a word.
wordnetMorphyAscii(word: string, pos?: "n" | "v" | "a" | "r"): string

Parameters

word
string
required
Word to find the root form of
pos
'n' | 'v' | 'a' | 'r'
Part of speech: noun (n), verb (v), adjective (a), or adverb (r)

Example

const root = wasm.wordnetMorphyAscii("running", "v");
console.log(root); // "run"

const root2 = wasm.wordnetMorphyAscii("geese", "n");
console.log(root2); // "goose"

Performance Considerations

Memory Reuse

The WASM runtime reuses memory blocks across operations. Frequent calls to the same method type will benefit from this optimization:
// Efficient: Memory blocks reused
for (const text of texts) {
  const tokens = wasm.tokenizeAscii(text);
  processTokens(tokens);
}

Zero-Copy Operations

Use offset-based methods for zero-copy processing:
const { offsets, lengths, input } = wasm.tokenOffsetsAscii(text);

// Process tokens without creating strings
for (let i = 0; i < offsets.length; i++) {
  const start = offsets[i];
  const len = lengths[i];
  // Work directly with input.subarray(start, start + len)
}

See Also

Build docs developers (and LLMs) love