Overview
After initializing a WasmNltk instance, you can use these methods for high-performance text processing operations.
Tokenization
countTokensAscii()
Counts the number of tokens in ASCII text.
countTokensAscii(text: string): number
Example
const wasm = await WasmNltk.init();
const count = wasm.countTokensAscii("Hello, world! How are you?");
console.log(count); // 5
tokenizeAscii()
Tokenizes ASCII text into an array of lowercase tokens.
tokenizeAscii(text: string): string[]
Example
const tokens = wasm.tokenizeAscii("Hello, world!");
console.log(tokens); // ["hello", "world"]
tokenOffsetsAscii()
Returns token offsets and lengths for zero-copy tokenization.
tokenOffsetsAscii(text: string): {
total: number;
offsets: Uint32Array;
lengths: Uint32Array;
input: Uint8Array;
}
Example
const result = wasm.tokenOffsetsAscii("Hello world");
console.log(result.total); // 2
console.log(result.offsets); // Uint32Array [0, 6]
console.log(result.lengths); // Uint32Array [5, 5]
normalizeTokensAscii()
Normalizes tokens (lowercase, filter stopwords, remove punctuation).
normalizeTokensAscii(text: string, removeStopwords?: boolean): string[]
Parameters
Whether to remove common English stopwords
Example
const normalized = wasm.normalizeTokensAscii("The quick brown fox", true);
console.log(normalized); // ["quick", "brown", "fox"] ("the" removed)
normalizedTokenOffsetsAscii()
Returns normalized token offsets for zero-copy processing.
normalizedTokenOffsetsAscii(text: string, removeStopwords?: boolean): {
total: number;
offsets: Uint32Array;
lengths: Uint32Array;
input: Uint8Array;
}
Sentence Tokenization
sentenceTokenizePunktAscii()
Tokenizes text into sentences using the Punkt algorithm.
sentenceTokenizePunktAscii(text: string): string[]
Example
const sentences = wasm.sentenceTokenizePunktAscii(
"Hello world. How are you? I'm fine."
);
console.log(sentences);
// ["Hello world.", "How are you?", "I'm fine."]
N-grams & Metrics
countNgramsAscii()
Counts the total number of n-grams in the text.
countNgramsAscii(text: string, n: number): number
Example
const count = wasm.countNgramsAscii("one two three four", 2);
console.log(count); // 3 ("one two", "two three", "three four")
computeAsciiMetrics()
Computes comprehensive text metrics in a single pass.
computeAsciiMetrics(text: string, n: number): AsciiMetrics
AsciiMetrics Type
type AsciiMetrics = {
tokens: number; // Total token count
uniqueTokens: number; // Unique token count
ngrams: number; // Total n-gram count
uniqueNgrams: number; // Unique n-gram count
};
Example
const metrics = wasm.computeAsciiMetrics("one two three two one", 2);
console.log(metrics);
// {
// tokens: 5,
// uniqueTokens: 3,
// ngrams: 4,
// uniqueNgrams: 4
// }
Machine Learning
perceptronPredictBatch()
Performs batch prediction using a perceptron model.
perceptronPredictBatch(
featureIds: Uint32Array,
tokenOffsets: Uint32Array,
weights: Float32Array,
modelFeatureCount: number,
tagCount: number
): Uint16Array
Parameters
Feature IDs for all tokens (flat array)
Offsets into featureIds array for each token (length = tokenCount + 1)
Model weights (shape: [modelFeatureCount, tagCount])
Number of features in the model
Returns
Returns Uint16Array of predicted tag IDs (length = tokenCount).
naiveBayesLogScoresIds()
Computes Naive Bayes log-probability scores for classification.
naiveBayesLogScoresIds(input: {
docTokenIds: Uint32Array;
vocabSize: number;
tokenCountsMatrix: Uint32Array;
labelDocCounts: Uint32Array;
labelTokenTotals: Uint32Array;
totalDocs: number;
smoothing: number;
}): Float64Array
Returns
Returns Float64Array of log scores for each label.
Language Modeling
evaluateLanguageModelIds()
Evaluates an n-gram language model.
evaluateLanguageModelIds(input: {
tokenIds: Uint32Array;
sentenceOffsets: Uint32Array;
order: number;
model: WasmLmModelType;
gamma: number;
discount: number;
vocabSize: number;
probeContextFlat: Uint32Array;
probeContextLens: Uint32Array;
probeWordIds: Uint32Array;
perplexityTokenIds: Uint32Array;
prefixTokenIds: Uint32Array;
}): { scores: Float64Array; perplexity: number }
WasmLmModelType
type WasmLmModelType = "mle" | "lidstone" | "kneser_ney_interpolated";
Parameters
N-gram order (e.g., 2 for bigram, 3 for trigram)
Smoothing parameter (for Lidstone)
Discount parameter (for Kneser-Ney)
Returns
Returns an object with:
scores: Float64Array of probability scores for probe words
perplexity: Overall perplexity score
Parsing & Chunking
chunkIobIds()
Performs IOB chunking on tagged tokens.
chunkIobIds(input: {
tokenTagIds: Uint16Array;
atomAllowedOffsets: Uint32Array;
atomAllowedLengths: Uint32Array;
atomAllowedFlat: Uint16Array;
atomMins: Uint8Array;
atomMaxs: Uint8Array;
ruleAtomOffsets: Uint32Array;
ruleAtomCounts: Uint32Array;
ruleLabelIds: Uint16Array;
}): { labelIds: Uint16Array; begins: Uint8Array }
cykRecognizeIds()
Performs CYK parsing recognition on a token sequence.
cykRecognizeIds(input: {
tokenBits: BigUint64Array;
binaryLeft: Uint16Array;
binaryRight: Uint16Array;
binaryParent: Uint16Array;
unaryChild: Uint16Array;
unaryParent: Uint16Array;
startSymbol: number;
}): boolean
Returns
Returns true if the input is recognized by the grammar, false otherwise.
WordNet
wordnetMorphyAscii()
Finds the morphological root form of a word.
wordnetMorphyAscii(word: string, pos?: "n" | "v" | "a" | "r"): string
Parameters
Word to find the root form of
Part of speech: noun (n), verb (v), adjective (a), or adverb (r)
Example
const root = wasm.wordnetMorphyAscii("running", "v");
console.log(root); // "run"
const root2 = wasm.wordnetMorphyAscii("geese", "n");
console.log(root2); // "goose"
Memory Reuse
The WASM runtime reuses memory blocks across operations. Frequent calls to the same method type will benefit from this optimization:
// Efficient: Memory blocks reused
for (const text of texts) {
const tokens = wasm.tokenizeAscii(text);
processTokens(tokens);
}
Zero-Copy Operations
Use offset-based methods for zero-copy processing:
const { offsets, lengths, input } = wasm.tokenOffsetsAscii(text);
// Process tokens without creating strings
for (let i = 0; i < offsets.length; i++) {
const start = offsets[i];
const len = lengths[i];
// Work directly with input.subarray(start, start + len)
}
See Also