WasmNltk Class
The WasmNltk class provides a high-level interface to the bun_nltk WebAssembly module.
Initialization
WasmNltk.init()
static async init(init?: WasmNltkInit): Promise<WasmNltk>
Initializes a new WASM instance.
Parameters:
Raw WASM module bytes. Use this when loading from a URL or bundled asset.
Path to the WASM file. Defaults to ../native/bun_nltk.wasm relative to the module.
Returns: Promise<WasmNltk> - Initialized WASM instance
Example:
import { WasmNltk } from 'bun_nltk/wasm';
// From URL
const response = await fetch('/bun_nltk.wasm');
const wasmBytes = new Uint8Array(await response.arrayBuffer());
const nltk = await WasmNltk.init({ wasmBytes });
// From path (Node.js/Bun)
const nltkPath = await WasmNltk.init({ wasmPath: './bun_nltk.wasm' });
// Default path
const nltkDefault = await WasmNltk.init();
Memory Management
dispose()
Frees all allocated memory blocks. Call this when you’re done using the instance.
Example:
const nltk = await WasmNltk.init({ wasmBytes });
// Use nltk...
nltk.dispose();
Tokenization Methods
tokenizeAscii()
tokenizeAscii(text: string): string[]
Tokenizes ASCII text into lowercase words.
Parameters:
Input text to tokenize (ASCII only)
Returns: string[] - Array of lowercase tokens
Example:
const tokens = nltk.tokenizeAscii('Hello, World! This is a test.');
// ['hello', 'world', 'this', 'is', 'a', 'test']
tokenOffsetsAscii()
tokenOffsetsAscii(text: string): {
total: number;
offsets: Uint32Array;
lengths: Uint32Array;
input: Uint8Array;
}
Returns token offsets and lengths for zero-copy processing.
Parameters:
Returns: Object with:
total: Number of tokens found
offsets: Byte offsets for each token in the input
lengths: Byte lengths for each token
input: UTF-8 encoded input bytes
Example:
const { total, offsets, lengths, input } = nltk.tokenOffsetsAscii('Hello World');
const decoder = new TextDecoder();
for (let i = 0; i < total; i++) {
const start = offsets[i];
const len = lengths[i];
const token = decoder.decode(input.subarray(start, start + len));
console.log(token);
}
normalizeTokensAscii()
normalizeTokensAscii(text: string, removeStopwords?: boolean): string[]
Tokenizes and normalizes text, optionally removing stopwords.
Parameters:
Input text to tokenize and normalize
Whether to remove stopwords (a, an, the, etc.)
Returns: string[] - Array of normalized tokens
Example:
const normalized = nltk.normalizeTokensAscii('The quick brown fox', true);
// ['quick', 'brown', 'fox'] - 'the' removed
const withStops = nltk.normalizeTokensAscii('The quick brown fox', false);
// ['the', 'quick', 'brown', 'fox']
normalizedTokenOffsetsAscii()
normalizedTokenOffsetsAscii(text: string, removeStopwords?: boolean): {
total: number;
offsets: Uint32Array;
lengths: Uint32Array;
input: Uint8Array;
}
Returns normalized token offsets for zero-copy processing.
Parameters:
Whether to remove stopwords
Returns: Same structure as tokenOffsetsAscii()
Sentence Tokenization
sentenceTokenizePunktAscii()
sentenceTokenizePunktAscii(text: string): string[]
Splits text into sentences using the Punkt algorithm.
Parameters:
Input text to split into sentences
Returns: string[] - Array of sentences
Example:
const text = 'Hello world! How are you? I am fine.';
const sentences = nltk.sentenceTokenizePunktAscii(text);
// ['Hello world!', 'How are you?', 'I am fine.']
Metrics & Counting
countTokensAscii()
countTokensAscii(text: string): number
Counts the number of tokens in text.
Parameters:
Returns: number - Token count
Example:
const count = nltk.countTokensAscii('Hello world from bun_nltk');
// 4
countNgramsAscii()
countNgramsAscii(text: string, n: number): number
Counts the number of n-grams in text.
Parameters:
N-gram size (e.g., 2 for bigrams, 3 for trigrams)
Returns: number - N-gram count
Example:
const bigrams = nltk.countNgramsAscii('the quick brown fox', 2);
// 3 ("the quick", "quick brown", "brown fox")
const trigrams = nltk.countNgramsAscii('the quick brown fox', 3);
// 2 ("the quick brown", "quick brown fox")
computeAsciiMetrics()
computeAsciiMetrics(text: string, n: number): AsciiMetrics
Computes comprehensive text metrics including token and n-gram statistics.
Parameters:
Returns: AsciiMetrics object with:
tokens: Total token count
uniqueTokens: Unique token count
ngrams: Total n-gram count
uniqueNgrams: Unique n-gram count
Example:
const metrics = nltk.computeAsciiMetrics('the cat and the dog', 2);
// {
// tokens: 5,
// uniqueTokens: 4, // 'the' appears twice
// ngrams: 4,
// uniqueNgrams: 4
// }
WordNet
wordnetMorphyAscii()
wordnetMorphyAscii(word: string, pos?: 'n' | 'v' | 'a' | 'r'): string
Finds the base/lemma form of a word using WordNet’s morphological analysis.
Parameters:
Part of speech: ‘n’ (noun), ‘v’ (verb), ‘a’ (adjective), ‘r’ (adverb)
Returns: string - Base form, or empty string if not found
Example:
const base1 = nltk.wordnetMorphyAscii('running', 'v');
// 'run'
const base2 = nltk.wordnetMorphyAscii('geese', 'n');
// 'goose'
const base3 = nltk.wordnetMorphyAscii('better', 'a');
// 'good'
Part-of-Speech Tagging
perceptronPredictBatch()
perceptronPredictBatch(
featureIds: Uint32Array,
tokenOffsets: Uint32Array,
weights: Float32Array,
modelFeatureCount: number,
tagCount: number
): Uint16Array
Performs batch POS tag prediction using a perceptron model.
Parameters:
Flattened array of feature IDs for all tokens
Offsets into featureIds for each token (length = token count + 1)
Model weights (shape: [modelFeatureCount * tagCount])
Number of features in the model
Returns: Uint16Array - Predicted tag IDs for each token
Example:
const featureIds = new Uint32Array([0, 1, 2, 3, 4, 5]);
const tokenOffsets = new Uint32Array([0, 2, 4, 6]); // 3 tokens
const weights = new Float32Array(1000); // Pre-trained weights
const tagIds = nltk.perceptronPredictBatch(
featureIds,
tokenOffsets,
weights,
100, // feature count
10 // tag count
);
// Uint16Array [3, 7, 2] - tag IDs for each token
Language Models
evaluateLanguageModelIds()
evaluateLanguageModelIds(input: {
tokenIds: Uint32Array;
sentenceOffsets: Uint32Array;
order: number;
model: WasmLmModelType;
gamma: number;
discount: number;
vocabSize: number;
probeContextFlat: Uint32Array;
probeContextLens: Uint32Array;
probeWordIds: Uint32Array;
perplexityTokenIds: Uint32Array;
prefixTokenIds: Uint32Array;
}): { scores: Float64Array; perplexity: number }
Evaluates a language model and computes scores and perplexity.
Parameters:
Sentence boundary offsets
N-gram order (e.g., 3 for trigram model)
input.model
'mle' | 'lidstone' | 'kneser_ney_interpolated'
required
Language model type
Smoothing parameter (for Lidstone)
Discount parameter (for Kneser-Ney)
Length of each probe context
Words to score in each context
Token IDs for perplexity calculation
Prefix tokens to condition on
Returns: Object with:
scores: Log probability scores for probe words
perplexity: Model perplexity on the test tokens
Chunking & Parsing
chunkIobIds()
chunkIobIds(input: {
tokenTagIds: Uint16Array;
atomAllowedOffsets: Uint32Array;
atomAllowedLengths: Uint32Array;
atomAllowedFlat: Uint16Array;
atomMins: Uint8Array;
atomMaxs: Uint8Array;
ruleAtomOffsets: Uint32Array;
ruleAtomCounts: Uint32Array;
ruleLabelIds: Uint16Array;
}): { labelIds: Uint16Array; begins: Uint8Array }
Performs IOB chunking on POS-tagged tokens.
Returns: Object with:
labelIds: Chunk label ID for each token
begins: 1 if token begins a chunk, 0 otherwise
cykRecognizeIds()
cykRecognizeIds(input: {
tokenBits: BigUint64Array;
binaryLeft: Uint16Array;
binaryRight: Uint16Array;
binaryParent: Uint16Array;
unaryChild: Uint16Array;
unaryParent: Uint16Array;
startSymbol: number;
}): boolean
Runs the CYK parsing algorithm to check if input is grammatical.
Returns: boolean - true if the input is recognized by the grammar
Classification
naiveBayesLogScoresIds()
naiveBayesLogScoresIds(input: {
docTokenIds: Uint32Array;
vocabSize: number;
tokenCountsMatrix: Uint32Array;
labelDocCounts: Uint32Array;
labelTokenTotals: Uint32Array;
totalDocs: number;
smoothing: number;
}): Float64Array
Computes Naive Bayes log probability scores for document classification.
Parameters:
Token IDs in the document to classify
Token counts per label (shape: [labelCount * vocabSize])
Number of documents per label
Total token count per label
Total number of training documents
Laplace smoothing parameter (typically 1.0)
Returns: Float64Array - Log probability score for each label
Example:
const scores = nltk.naiveBayesLogScoresIds({
docTokenIds: new Uint32Array([0, 5, 10, 15]),
vocabSize: 1000,
tokenCountsMatrix: trainingCounts,
labelDocCounts: new Uint32Array([100, 150, 80]),
labelTokenTotals: new Uint32Array([5000, 7500, 4000]),
totalDocs: 330,
smoothing: 1.0
});
// Float64Array [-2.5, -3.1, -4.2] - scores for each label
// Predict class
const predictedLabel = scores.indexOf(Math.max(...scores));
Types
WasmNltkInit
type WasmNltkInit = {
wasmBytes?: Uint8Array;
wasmPath?: string;
};
AsciiMetrics
type AsciiMetrics = {
tokens: number;
uniqueTokens: number;
ngrams: number;
uniqueNgrams: number;
};
WasmLmModelType
type WasmLmModelType = "mle" | "lidstone" | "kneser_ney_interpolated";
mle: Maximum Likelihood Estimation
lidstone: Lidstone smoothing (add-k)
kneser_ney_interpolated: Kneser-Ney interpolated smoothing
Memory Pool Details
The WasmNltk class maintains an internal memory pool to optimize allocations. The pool uses the following strategy:
- Block Keys: Each operation type has a unique key (e.g., “offsets”, “metrics”)
- Reuse: If a block exists and is large enough, it’s reused
- Reallocation: If a larger block is needed, the old one is freed first
- Cleanup: All blocks are freed when
dispose() is called
// Internal pool structure (for reference)
class WasmNltk {
private readonly blocks = new Map<string, PoolBlock>();
private ensureBlock(key: string, bytes: number): PoolBlock {
const existing = this.blocks.get(key);
if (existing && existing.bytes >= bytes) return existing;
// Reallocate if needed
if (existing) {
this.exports.bunnltk_wasm_free(existing.ptr, existing.bytes);
}
const ptr = this.exports.bunnltk_wasm_alloc(bytes);
const block = { ptr, bytes };
this.blocks.set(key, block);
return block;
}
}
Next Steps