This guide walks you through the most common bun_nltk operations. You’ll learn how to tokenize text, count tokens and n-grams, perform POS tagging, and classify text.
Tokenization is the foundation of most NLP tasks. bun_nltk provides several tokenization methods:
1
Simple ASCII tokenization
Use tokenizeAsciiNative for fast, case-insensitive tokenization:
import { tokenizeAsciiNative } from "bun_nltk";const text = "Hello world! This is a test.";const tokens = tokenizeAsciiNative(text);console.log(tokens);// Output: ["hello", "world", "this", "is", "a", "test"]
ASCII tokenization uses the pattern [A-Za-z0-9']+ and normalizes to lowercase for optimal performance.
2
Word tokenization with contractions
Use wordTokenizeSubset for PTB-style contraction handling:
import { wordTokenizeSubset } from "bun_nltk";const text = "John's big idea isn't all that bad.";const tokens = wordTokenizeSubset(text);console.log(tokens);// Output: ["John", "'s", "big", "idea", "is", "n't", "all", "that", "bad", "."]
This tokenizer properly splits contractions like “isn’t” into [“is”, “n’t”] and possessives like “John’s” into [“John”, “‘s”].
3
Tweet tokenization
Use tweetTokenizeSubset for social media text:
import { tweetTokenizeSubset } from "bun_nltk";const text = "@user Let's test waaaayyy too much!!!!!! #nlp";const tokens = tweetTokenizeSubset(text, { stripHandles: true, reduceLen: true,});console.log(tokens);// Output: ["Let's", "test", "waaayyy", "too", "much", "!", "!", "!", "!", "!", "!", "#nlp"]
Split text into sentences using the Punkt algorithm:
import { trainPunktModel, sentenceTokenizePunkt, defaultPunktModel,} from "bun_nltk";// Option 1: Use default modelconst text = "Dr. Smith went home. He stayed there. Mr. Jones left early.";const sentences = sentenceTokenizePunkt(text, defaultPunktModel());console.log(sentences);// Output: [// "Dr. Smith went home.",// "He stayed there.",// "Mr. Jones left early."// ]// Option 2: Train custom model on your corpusconst trainingText = "Dr. Adams wrote a paper. Dr. Brown reviewed it. The U.S. team won.";const customModel = trainPunktModel(trainingText);const newText = "Dr. Adams arrived yesterday. He presented the paper.";const customSentences = sentenceTokenizePunkt(newText, customModel);console.log(customSentences);// Output: [// "Dr. Adams arrived yesterday.",// "He presented the paper."// ]
For platforms without native binaries or browser environments, use the WASM runtime:
import { WasmNltk } from "bun_nltk";// Initialize WASM runtimeconst wasm = new WasmNltk();await wasm.init();// Use WASM methods (same API as native)const text = "Hello world from WASM!";const tokens = wasm.tokenizeAscii(text);const count = wasm.countTokensAscii(text);console.log(`Tokens: ${tokens.join(", ")}`);console.log(`Count: ${count}`);
The WASM runtime provides the same API as native functions but works in any environment, including browsers.