Skip to main content
Tokenization is the process of breaking text into individual tokens (words, numbers, or symbols). bun_nltk provides several tokenizers optimized for different use cases.

Available Tokenizers

ASCII Tokenizer

The fastest tokenizer for ASCII text. Automatically lowercase tokens.
import { tokenizeAsciiNative } from "bun_nltk";

const text = "Hello World! This is a test.";
const tokens = tokenizeAsciiNative(text);
// ["hello", "world", "this", "is", "a", "test"]
Key Features:
  • High-performance SIMD implementation
  • Matches pattern: [A-Za-z0-9']+
  • Automatically converts to lowercase
  • Returns string[]

Word Tokenizer

Handles contractions and clitics like β€œn’t”, β€œβ€˜s”, β€œβ€˜ll”, etc.
import { wordTokenizeSubset } from "bun_nltk";

const text = "I can't believe it's working!";
const tokens = wordTokenizeSubset(text);
// ["I", "ca", "n't", "believe", "it", "'s", "working", "!"]
Signature:
function wordTokenizeSubset(text: string): string[]
Contraction Handling:
  • n't β†’ Splits as separate token (e.g., β€œcan’t” β†’ β€œca”, β€œn’t”)
  • 's, 'm, 'd, 're, 've, 'll β†’ Splits and lowercases (e.g., β€œit’s” β†’ β€œit”, β€œβ€˜s”)
1

Basic Usage

import { wordTokenizeSubset } from "bun_nltk";

const text = "They'll be there soon.";
const tokens = wordTokenizeSubset(text);
console.log(tokens);
// ["They", "'ll", "be", "there", "soon"]
2

Handling Contractions

const examples = [
  "I'm happy",      // ["I", "'m", "happy"]
  "don't worry",    // ["don", "n't", "worry"]
  "we've done it",  // ["we", "'ve", "done", "it"]
];

for (const text of examples) {
  console.log(wordTokenizeSubset(text));
}

Tweet Tokenizer

Specialized tokenizer for social media text with support for hashtags, mentions, URLs, and emojis.
import { tweetTokenizeSubset } from "bun_nltk";

const tweet = "Check out @bun_nltk! #NLP https://example.com 😊";
const tokens = tweetTokenizeSubset(tweet);
// ["Check", "out", "@bun_nltk", "!", "#NLP", "https://example.com", "😊"]
Signature:
type TweetTokenizerOptions = {
  stripHandles?: boolean;        // Remove @mentions (default: false)
  reduceLen?: boolean;           // Reduce repeated characters (default: false)
  matchPhoneNumbers?: boolean;   // Match phone numbers (default: true)
};

function tweetTokenizeSubset(
  text: string, 
  options?: TweetTokenizerOptions
): string[]
1

Default Tweet Tokenization

import { tweetTokenizeSubset } from "bun_nltk";

const tweet = "@user Check this out! #awesome https://example.com";
const tokens = tweetTokenizeSubset(tweet);
console.log(tokens);
// ["@user", "Check", "this", "out", "!", "#awesome", "https://example.com"]
2

Strip Mentions

const tweet = "@alice @bob Hello everyone!";
const tokens = tweetTokenizeSubset(tweet, { 
  stripHandles: true 
});
console.log(tokens);
// ["Hello", "everyone", "!"]
3

Reduce Character Repetition

const tweet = "Sooooo cooool!!!";
const tokens = tweetTokenizeSubset(tweet, { 
  reduceLen: true 
});
console.log(tokens);
// ["Sooo", "coool", "!!!"]
// Note: Reduces to maximum 3 repetitions for alphabetic characters
4

Match Phone Numbers

const text = "Call me at 555-123-4567";
const tokens = tweetTokenizeSubset(text, { 
  matchPhoneNumbers: true 
});
console.log(tokens);
// ["Call", "me", "at", "555-123-4567"]
5

Emoji Support

const tweet = "Great news! πŸŽ‰πŸŽŠ #celebration";
const tokens = tweetTokenizeSubset(tweet);
console.log(tokens);
// ["Great", "news", "!", "πŸŽ‰", "🎊", "#celebration"]

Pattern Matching

Tweet Tokenizer Patterns:
  • URLs: https?://\S+
  • Hashtags: #[\w_]+
  • Mentions: @[\w_]+
  • Words: [A-Za-z0-9]+(?:'[A-Za-z0-9]+)?
  • Phone Numbers: Various formats (when enabled)
  • Emojis: Full Unicode emoji sequences

Common Use Cases

Processing Social Media Data

import { tweetTokenizeSubset } from "bun_nltk";

const tweets = [
  "Loving #bunjs! So fast ⚑",
  "@team Check out this amazing library",
  "Download now: https://bun.sh"
];

const processedTweets = tweets.map(tweet => 
  tweetTokenizeSubset(tweet, { stripHandles: true })
);

Handling Contractions

import { wordTokenizeSubset } from "bun_nltk";

const sentences = [
  "I'm going to the store.",
  "They've been waiting for hours.",
  "It's a beautiful day."
];

const tokenized = sentences.map(wordTokenizeSubset);

High-Performance Batch Processing

import { tokenizeAsciiNative } from "bun_nltk";

const documents = [
  "First document text",
  "Second document text",
  // ... thousands more
];

// Fastest option for ASCII text
const allTokens = documents.map(tokenizeAsciiNative);

Performance Comparison

  • tokenizeAsciiNative: Fastest, uses SIMD optimizations
  • wordTokenizeSubset: Moderate speed, handles contractions
  • tweetTokenizeSubset: Slower but feature-rich for social media
Use tokenizeAsciiNative when you need maximum speed and don’t require special handling of contractions or social media features.
All tokenizers automatically handle edge cases like multiple spaces, punctuation, and empty strings.

Build docs developers (and LLMs) love