Runtime API Reference

WasmNltk Class

The WasmNltk class provides a high-level interface to the bun_nltk WebAssembly module.

Initialization

`WasmNltk.init()`

static async init(init?: WasmNltkInit): Promise<WasmNltk>

Initializes a new WASM instance. Parameters:

init.wasmBytes

Uint8Array

Raw WASM module bytes. Use this when loading from a URL or bundled asset.

init.wasmPath

string

Path to the WASM file. Defaults to ../native/bun_nltk.wasm relative to the module.

Returns: Promise<WasmNltk> - Initialized WASM instance Example:

import { WasmNltk } from 'bun_nltk/wasm';

// From URL
const response = await fetch('/bun_nltk.wasm');
const wasmBytes = new Uint8Array(await response.arrayBuffer());
const nltk = await WasmNltk.init({ wasmBytes });

// From path (Node.js/Bun)
const nltkPath = await WasmNltk.init({ wasmPath: './bun_nltk.wasm' });

// Default path
const nltkDefault = await WasmNltk.init();

Memory Management

`dispose()`

dispose(): void

Frees all allocated memory blocks. Call this when you’re done using the instance. Example:

const nltk = await WasmNltk.init({ wasmBytes });
// Use nltk...
nltk.dispose();

Tokenization Methods

`tokenizeAscii()`

tokenizeAscii(text: string): string[]

Tokenizes ASCII text into lowercase words. Parameters:

text

string

required

Input text to tokenize (ASCII only)

Returns: string[] - Array of lowercase tokens Example:

const tokens = nltk.tokenizeAscii('Hello, World! This is a test.');
// ['hello', 'world', 'this', 'is', 'a', 'test']

`tokenOffsetsAscii()`

tokenOffsetsAscii(text: string): {
  total: number;
  offsets: Uint32Array;
  lengths: Uint32Array;
  input: Uint8Array;
}

Returns token offsets and lengths for zero-copy processing. Parameters:

text

string

required

Input text to tokenize

Returns: Object with:

total: Number of tokens found
offsets: Byte offsets for each token in the input
lengths: Byte lengths for each token
input: UTF-8 encoded input bytes

Example:

const { total, offsets, lengths, input } = nltk.tokenOffsetsAscii('Hello World');
const decoder = new TextDecoder();

for (let i = 0; i < total; i++) {
  const start = offsets[i];
  const len = lengths[i];
  const token = decoder.decode(input.subarray(start, start + len));
  console.log(token);
}

`normalizeTokensAscii()`

normalizeTokensAscii(text: string, removeStopwords?: boolean): string[]

Tokenizes and normalizes text, optionally removing stopwords. Parameters:

text

string

required

Input text to tokenize and normalize

removeStopwords

boolean

default:"true"

Whether to remove stopwords (a, an, the, etc.)

Returns: string[] - Array of normalized tokens Example:

const normalized = nltk.normalizeTokensAscii('The quick brown fox', true);
// ['quick', 'brown', 'fox'] - 'the' removed

const withStops = nltk.normalizeTokensAscii('The quick brown fox', false);
// ['the', 'quick', 'brown', 'fox']

`normalizedTokenOffsetsAscii()`

normalizedTokenOffsetsAscii(text: string, removeStopwords?: boolean): {
  total: number;
  offsets: Uint32Array;
  lengths: Uint32Array;
  input: Uint8Array;
}

Returns normalized token offsets for zero-copy processing. Parameters:

text

string

required

Input text

removeStopwords

boolean

default:"true"

Whether to remove stopwords

Returns: Same structure as tokenOffsetsAscii()

Sentence Tokenization

`sentenceTokenizePunktAscii()`

sentenceTokenizePunktAscii(text: string): string[]

Splits text into sentences using the Punkt algorithm. Parameters:

text

string

required

Input text to split into sentences

Returns: string[] - Array of sentences Example:

const text = 'Hello world! How are you? I am fine.';
const sentences = nltk.sentenceTokenizePunktAscii(text);
// ['Hello world!', 'How are you?', 'I am fine.']

Metrics & Counting

`countTokensAscii()`

countTokensAscii(text: string): number

Counts the number of tokens in text. Parameters:

text

string

required

Input text

Returns: number - Token count Example:

const count = nltk.countTokensAscii('Hello world from bun_nltk');
// 4

`countNgramsAscii()`

countNgramsAscii(text: string, n: number): number

Counts the number of n-grams in text. Parameters:

text

string

required

Input text

number

required

N-gram size (e.g., 2 for bigrams, 3 for trigrams)

Returns: number - N-gram count Example:

const bigrams = nltk.countNgramsAscii('the quick brown fox', 2);
// 3 ("the quick", "quick brown", "brown fox")

const trigrams = nltk.countNgramsAscii('the quick brown fox', 3);
// 2 ("the quick brown", "quick brown fox")

`computeAsciiMetrics()`

computeAsciiMetrics(text: string, n: number): AsciiMetrics

Computes comprehensive text metrics including token and n-gram statistics. Parameters:

text

string

required

Input text to analyze

number

required

N-gram size for metrics

Returns: AsciiMetrics object with:

tokens: Total token count
uniqueTokens: Unique token count
ngrams: Total n-gram count
uniqueNgrams: Unique n-gram count

Example:

const metrics = nltk.computeAsciiMetrics('the cat and the dog', 2);
// {
//   tokens: 5,
//   uniqueTokens: 4,  // 'the' appears twice
//   ngrams: 4,
//   uniqueNgrams: 4
// }

WordNet

`wordnetMorphyAscii()`

wordnetMorphyAscii(word: string, pos?: 'n' | 'v' | 'a' | 'r'): string

Finds the base/lemma form of a word using WordNet’s morphological analysis. Parameters:

word

string

required

Word to lemmatize

pos

'n' | 'v' | 'a' | 'r'

Part of speech: ‘n’ (noun), ‘v’ (verb), ‘a’ (adjective), ‘r’ (adverb)

Returns: string - Base form, or empty string if not found Example:

const base1 = nltk.wordnetMorphyAscii('running', 'v');
// 'run'

const base2 = nltk.wordnetMorphyAscii('geese', 'n');
// 'goose'

const base3 = nltk.wordnetMorphyAscii('better', 'a');
// 'good'

Part-of-Speech Tagging

`perceptronPredictBatch()`

perceptronPredictBatch(
  featureIds: Uint32Array,
  tokenOffsets: Uint32Array,
  weights: Float32Array,
  modelFeatureCount: number,
  tagCount: number
): Uint16Array

Performs batch POS tag prediction using a perceptron model. Parameters:

featureIds

Uint32Array

required

Flattened array of feature IDs for all tokens

tokenOffsets

Uint32Array

required

Offsets into featureIds for each token (length = token count + 1)

weights

Float32Array

required

Model weights (shape: [modelFeatureCount * tagCount])

modelFeatureCount

number

required

Number of features in the model

tagCount

number

required

Number of POS tags

Returns: Uint16Array - Predicted tag IDs for each token Example:

const featureIds = new Uint32Array([0, 1, 2, 3, 4, 5]);
const tokenOffsets = new Uint32Array([0, 2, 4, 6]); // 3 tokens
const weights = new Float32Array(1000); // Pre-trained weights

const tagIds = nltk.perceptronPredictBatch(
  featureIds,
  tokenOffsets,
  weights,
  100,  // feature count
  10    // tag count
);
// Uint16Array [3, 7, 2] - tag IDs for each token

Language Models

`evaluateLanguageModelIds()`

evaluateLanguageModelIds(input: {
  tokenIds: Uint32Array;
  sentenceOffsets: Uint32Array;
  order: number;
  model: WasmLmModelType;
  gamma: number;
  discount: number;
  vocabSize: number;
  probeContextFlat: Uint32Array;
  probeContextLens: Uint32Array;
  probeWordIds: Uint32Array;
  perplexityTokenIds: Uint32Array;
  prefixTokenIds: Uint32Array;
}): { scores: Float64Array; perplexity: number }

Evaluates a language model and computes scores and perplexity. Parameters:

input.tokenIds

Uint32Array

required

Training token IDs

input.sentenceOffsets

Uint32Array

required

Sentence boundary offsets

input.order

number

required

N-gram order (e.g., 3 for trigram model)

input.model

'mle' | 'lidstone' | 'kneser_ney_interpolated'

required

Language model type

input.gamma

number

required

Smoothing parameter (for Lidstone)

input.discount

number

required

Discount parameter (for Kneser-Ney)

input.vocabSize

number

required

Vocabulary size

input.probeContextFlat

Uint32Array

required

Flattened probe contexts

input.probeContextLens

Uint32Array

required

Length of each probe context

input.probeWordIds

Uint32Array

required

Words to score in each context

input.perplexityTokenIds

Uint32Array

required

Token IDs for perplexity calculation

input.prefixTokenIds

Uint32Array

required

Prefix tokens to condition on

Returns: Object with:

scores: Log probability scores for probe words
perplexity: Model perplexity on the test tokens

Chunking & Parsing

`chunkIobIds()`

chunkIobIds(input: {
  tokenTagIds: Uint16Array;
  atomAllowedOffsets: Uint32Array;
  atomAllowedLengths: Uint32Array;
  atomAllowedFlat: Uint16Array;
  atomMins: Uint8Array;
  atomMaxs: Uint8Array;
  ruleAtomOffsets: Uint32Array;
  ruleAtomCounts: Uint32Array;
  ruleLabelIds: Uint16Array;
}): { labelIds: Uint16Array; begins: Uint8Array }

Performs IOB chunking on POS-tagged tokens. Returns: Object with:

labelIds: Chunk label ID for each token
begins: 1 if token begins a chunk, 0 otherwise

`cykRecognizeIds()`

cykRecognizeIds(input: {
  tokenBits: BigUint64Array;
  binaryLeft: Uint16Array;
  binaryRight: Uint16Array;
  binaryParent: Uint16Array;
  unaryChild: Uint16Array;
  unaryParent: Uint16Array;
  startSymbol: number;
}): boolean

Runs the CYK parsing algorithm to check if input is grammatical. Returns: boolean - true if the input is recognized by the grammar

Classification

`naiveBayesLogScoresIds()`

naiveBayesLogScoresIds(input: {
  docTokenIds: Uint32Array;
  vocabSize: number;
  tokenCountsMatrix: Uint32Array;
  labelDocCounts: Uint32Array;
  labelTokenTotals: Uint32Array;
  totalDocs: number;
  smoothing: number;
}): Float64Array

Computes Naive Bayes log probability scores for document classification. Parameters:

input.docTokenIds

Uint32Array

required

Token IDs in the document to classify

input.vocabSize

number

required

Total vocabulary size

input.tokenCountsMatrix

Uint32Array

required

Token counts per label (shape: [labelCount * vocabSize])

input.labelDocCounts

Uint32Array

required

Number of documents per label

input.labelTokenTotals

Uint32Array

required

Total token count per label

input.totalDocs

number

required

Total number of training documents

input.smoothing

number

required

Laplace smoothing parameter (typically 1.0)

Returns: Float64Array - Log probability score for each label Example:

const scores = nltk.naiveBayesLogScoresIds({
  docTokenIds: new Uint32Array([0, 5, 10, 15]),
  vocabSize: 1000,
  tokenCountsMatrix: trainingCounts,
  labelDocCounts: new Uint32Array([100, 150, 80]),
  labelTokenTotals: new Uint32Array([5000, 7500, 4000]),
  totalDocs: 330,
  smoothing: 1.0
});
// Float64Array [-2.5, -3.1, -4.2] - scores for each label

// Predict class
const predictedLabel = scores.indexOf(Math.max(...scores));

Types

`WasmNltkInit`

type WasmNltkInit = {
  wasmBytes?: Uint8Array;
  wasmPath?: string;
};

`AsciiMetrics`

type AsciiMetrics = {
  tokens: number;
  uniqueTokens: number;
  ngrams: number;
  uniqueNgrams: number;
};

`WasmLmModelType`

type WasmLmModelType = "mle" | "lidstone" | "kneser_ney_interpolated";

mle: Maximum Likelihood Estimation
lidstone: Lidstone smoothing (add-k)
kneser_ney_interpolated: Kneser-Ney interpolated smoothing

Memory Pool Details

The WasmNltk class maintains an internal memory pool to optimize allocations. The pool uses the following strategy:

Block Keys: Each operation type has a unique key (e.g., “offsets”, “metrics”)
Reuse: If a block exists and is large enough, it’s reused
Reallocation: If a larger block is needed, the old one is freed first
Cleanup: All blocks are freed when dispose() is called

// Internal pool structure (for reference)
class WasmNltk {
  private readonly blocks = new Map<string, PoolBlock>();
  
  private ensureBlock(key: string, bytes: number): PoolBlock {
    const existing = this.blocks.get(key);
    if (existing && existing.bytes >= bytes) return existing;
    
    // Reallocate if needed
    if (existing) {
      this.exports.bunnltk_wasm_free(existing.ptr, existing.bytes);
    }
    
    const ptr = this.exports.bunnltk_wasm_alloc(bytes);
    const block = { ptr, bytes };
    this.blocks.set(key, block);
    return block;
  }
}

Next Steps

Browser Usage Guide - Using WASM in the browser
Tokenization - Deep dive into tokenization
Text Classification - Text classification with Naive Bayes

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

Runtime API Reference

WasmNltk Class

Initialization

`WasmNltk.init()`

Memory Management

`dispose()`

Tokenization Methods

`tokenizeAscii()`

`tokenOffsetsAscii()`

`normalizeTokensAscii()`

`normalizedTokenOffsetsAscii()`

Sentence Tokenization

`sentenceTokenizePunktAscii()`

Metrics & Counting

`countTokensAscii()`

`countNgramsAscii()`

`computeAsciiMetrics()`

WordNet

`wordnetMorphyAscii()`

Part-of-Speech Tagging

`perceptronPredictBatch()`

Language Models

`evaluateLanguageModelIds()`

Chunking & Parsing

`chunkIobIds()`

`cykRecognizeIds()`

Classification

`naiveBayesLogScoresIds()`

Types

`WasmNltkInit`

`AsciiMetrics`

`WasmLmModelType`

Memory Pool Details

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

​WasmNltk Class

​Initialization

​WasmNltk.init()

​Memory Management

​dispose()

​Tokenization Methods

​tokenizeAscii()

​tokenOffsetsAscii()

​normalizeTokensAscii()

​normalizedTokenOffsetsAscii()

​Sentence Tokenization

​sentenceTokenizePunktAscii()

​Metrics & Counting

​countTokensAscii()

​countNgramsAscii()

​computeAsciiMetrics()

​WordNet

​wordnetMorphyAscii()

​Part-of-Speech Tagging

​perceptronPredictBatch()

​Language Models

​evaluateLanguageModelIds()

​Chunking & Parsing

​chunkIobIds()

​cykRecognizeIds()

​Classification

​naiveBayesLogScoresIds()

​Types

​WasmNltkInit

​AsciiMetrics

​WasmLmModelType

​Memory Pool Details

​Next Steps

Build docs developers (and LLMs) love

WasmNltk Class

Initialization

`WasmNltk.init()`

Memory Management

`dispose()`

Tokenization Methods

`tokenizeAscii()`

`tokenOffsetsAscii()`

`normalizeTokensAscii()`

`normalizedTokenOffsetsAscii()`

Sentence Tokenization

`sentenceTokenizePunktAscii()`

Metrics & Counting

`countTokensAscii()`

`countNgramsAscii()`

`computeAsciiMetrics()`

WordNet

`wordnetMorphyAscii()`

Part-of-Speech Tagging

`perceptronPredictBatch()`

Language Models

`evaluateLanguageModelIds()`

Chunking & Parsing

`chunkIobIds()`

`cykRecognizeIds()`

Classification

`naiveBayesLogScoresIds()`

Types

`WasmNltkInit`

`AsciiMetrics`

`WasmLmModelType`

Memory Pool Details

Next Steps