Quickstart

This guide walks you through the most common bun_nltk operations. You’ll learn how to tokenize text, count tokens and n-grams, perform POS tagging, and classify text.

Make sure you’ve installed bun_nltk before continuing.

Basic tokenization

Tokenization is the foundation of most NLP tasks. bun_nltk provides several tokenization methods:

Simple ASCII tokenization

Use tokenizeAsciiNative for fast, case-insensitive tokenization:

import { tokenizeAsciiNative } from "bun_nltk";

const text = "Hello world! This is a test.";
const tokens = tokenizeAsciiNative(text);

console.log(tokens);
// Output: ["hello", "world", "this", "is", "a", "test"]

ASCII tokenization uses the pattern [A-Za-z0-9']+ and normalizes to lowercase for optimal performance.

Word tokenization with contractions

Use wordTokenizeSubset for PTB-style contraction handling:

import { wordTokenizeSubset } from "bun_nltk";

const text = "John's big idea isn't all that bad.";
const tokens = wordTokenizeSubset(text);

console.log(tokens);
// Output: ["John", "'s", "big", "idea", "is", "n't", "all", "that", "bad", "."]

This tokenizer properly splits contractions like “isn’t” into [“is”, “n’t”] and possessives like “John’s” into [“John”, “‘s”].

Tweet tokenization

Use tweetTokenizeSubset for social media text:

import { tweetTokenizeSubset } from "bun_nltk";

const text = "@user Let's test waaaayyy too much!!!!!! #nlp";
const tokens = tweetTokenizeSubset(text, {
  stripHandles: true,
  reduceLen: true,
});

console.log(tokens);
// Output: ["Let's", "test", "waaayyy", "too", "much", "!", "!", "!", "!", "!", "!", "#nlp"]

Options:

stripHandles: Remove @mentions
reduceLen: Reduce repeated characters (e.g., “waaaayyy” → “waaayyy”)
matchPhoneNumbers: Detect phone numbers as single tokens

Counting and statistics

bun_nltk provides blazing-fast token and n-gram counting with native Zig implementations:

import {
  countTokensAscii,
  countUniqueTokensAscii,
  countNgramsAscii,
  computeAsciiMetrics,
} from "bun_nltk";

const text = "the quick brown fox jumps over the lazy dog";

// Count total tokens
const tokenCount = countTokensAscii(text);
console.log(`Total tokens: ${tokenCount}`);
// Output: Total tokens: 9

// Count unique tokens
const uniqueCount = countUniqueTokensAscii(text);
console.log(`Unique tokens: ${uniqueCount}`);
// Output: Unique tokens: 8

// Count bigrams (n=2)
const bigramCount = countNgramsAscii(text, 2);
console.log(`Total bigrams: ${bigramCount}`);
// Output: Total bigrams: 8

// Get comprehensive metrics
const metrics = computeAsciiMetrics(text, 2);
console.log(metrics);
// Output: {
//   tokenCount: 9,
//   uniqueTokenCount: 8,
//   ngramCount: 8,
//   uniqueNgramCount: 8
// }

These counting operations use SIMD instructions on x64 platforms for maximum performance.

Sentence tokenization

Split text into sentences using the Punkt algorithm:

import {
  trainPunktModel,
  sentenceTokenizePunkt,
  defaultPunktModel,
} from "bun_nltk";

// Option 1: Use default model
const text = "Dr. Smith went home. He stayed there. Mr. Jones left early.";
const sentences = sentenceTokenizePunkt(text, defaultPunktModel());

console.log(sentences);
// Output: [
//   "Dr. Smith went home.",
//   "He stayed there.",
//   "Mr. Jones left early."
// ]

// Option 2: Train custom model on your corpus
const trainingText = "Dr. Adams wrote a paper. Dr. Brown reviewed it. The U.S. team won.";
const customModel = trainPunktModel(trainingText);

const newText = "Dr. Adams arrived yesterday. He presented the paper.";
const customSentences = sentenceTokenizePunkt(newText, customModel);

console.log(customSentences);
// Output: [
//   "Dr. Adams arrived yesterday.",
//   "He presented the paper."
// ]

Part-of-speech tagging

Tag words with their grammatical categories:

import { posTagAsciiNative } from "bun_nltk";

const text = "The quick brown fox jumps over the lazy dog";
const tagged = posTagAsciiNative(text);

console.log(tagged);
// Output: [
//   { token: "the", tag: "DT", tagId: 0 },
//   { token: "quick", tag: "JJ", tagId: 1 },
//   { token: "brown", tag: "JJ", tagId: 1 },
//   { token: "fox", tag: "NN", tagId: 2 },
//   { token: "jumps", tag: "VBZ", tagId: 3 },
//   // ...
// ]

For more accurate tagging, use the perceptron tagger with a trained model:

import {
  loadPerceptronTaggerModel,
  posTagPerceptronAscii,
} from "bun_nltk";

// Load a pre-trained model
const modelJson = await Bun.file("models/perceptron_tagger_ascii.json").json();
const model = loadPerceptronTaggerModel(modelJson);

// Tag text
const result = posTagPerceptronAscii(text, model);
console.log(result.tags);
// Output: ["DT", "JJ", "JJ", "NN", "VBZ", ...]

Text classification

Classify text with machine learning models:

import {
  trainNaiveBayesTextClassifier,
  type NaiveBayesExample,
} from "bun_nltk";

// Prepare training data
const trainingData: NaiveBayesExample[] = [
  { label: "pos", text: "excellent amazing great product happy joy" },
  { label: "pos", text: "good wonderful fast happy smooth" },
  { label: "neg", text: "awful bad terrible product sad hate" },
  { label: "neg", text: "slow broken painful angry bad" },
];

// Train classifier
const classifier = trainNaiveBayesTextClassifier(trainingData, {
  smoothing: 1.0,
});

// Classify new text
const prediction1 = classifier.classify("amazing happy good");
console.log(prediction1); // Output: "pos"

const prediction2 = classifier.classify("terrible bad hate");
console.log(prediction2); // Output: "neg"

// Get probability scores
const scores = classifier.probDistribution("great product");
console.log(scores);
// Output: { pos: 0.89, neg: 0.11 }

// Evaluate on test data
const testData: NaiveBayesExample[] = [
  { label: "pos", text: "great happy smooth excellent" },
  { label: "neg", text: "bad broken terrible slow" },
];

const evaluation = classifier.evaluate(testData);
console.log(evaluation);
// Output: { accuracy: 1.0, correct: 2, total: 2 }

N-grams and collocations

Generate n-grams and find significant word pairs:

import { ngramsAsciiNative, topPmiBigramsAscii } from "bun_nltk";

const text = "natural language processing with natural language toolkit";

// Generate bigrams (n=2)
const bigrams = ngramsAsciiNative(text, 2);
console.log(bigrams);
// Output: [
//   ["natural", "language"],
//   ["language", "processing"],
//   ["processing", "with"],
//   // ...
// ]

// Find top PMI (Pointwise Mutual Information) bigrams
const pmiPairs = topPmiBigramsAscii(text, 3, 2); // top 3, window size 2
console.log(pmiPairs);
// Output: Top collocations based on PMI scores

Using WASM runtime

For platforms without native binaries or browser environments, use the WASM runtime:

import { WasmNltk } from "bun_nltk";

// Initialize WASM runtime
const wasm = new WasmNltk();
await wasm.init();

// Use WASM methods (same API as native)
const text = "Hello world from WASM!";
const tokens = wasm.tokenizeAscii(text);
const count = wasm.countTokensAscii(text);

console.log(`Tokens: ${tokens.join(", ")}`);
console.log(`Count: ${count}`);

The WASM runtime provides the same API as native functions but works in any environment, including browsers.

Next steps

You now know the basics of bun_nltk! Explore more advanced features:

Language models

Build n-gram language models with smoothing

Parsing

Parse sentences with CFG, PCFG, and Earley parsers

WordNet

Explore semantic relationships and synsets

API reference

View the complete API documentation

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

Quickstart

Quickstart

Basic tokenization

Counting and statistics

Sentence tokenization

Part-of-speech tagging

Text classification

N-grams and collocations

Using WASM runtime

Next steps

Language models

Parsing

WordNet

API reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

​Quickstart

​Basic tokenization

​Counting and statistics

​Sentence tokenization

​Part-of-speech tagging

​Text classification

​N-grams and collocations

​Using WASM runtime

​Next steps

Language models

Parsing

WordNet

API reference

Build docs developers (and LLMs) love

Quickstart

Basic tokenization

Counting and statistics

Sentence tokenization

Part-of-speech tagging

Text classification

N-grams and collocations

Using WASM runtime

Next steps