Word Tokenization

wordTokenizeSubset

Tokenize text into words with support for common English contractions.

text

string

required

The text to tokenize

tokens

string[]

Array of word tokens with contractions split appropriately

import { wordTokenizeSubset } from 'bun_nltk';

const text = "I can't believe it's working!";
const tokens = wordTokenizeSubset(text);
console.log(tokens);
// ["I", "ca", "n't", "believe", "it", "'s", "working", "!"]

Contraction Handling

The tokenizer splits common contractions:

n't → separate token (e.g., “can’t” → “ca”, “n’t”)
's, 'm, 'd, 're, 've, 'll → separate tokens (e.g., “it’s” → “it”, “‘s”)

tweetTokenizeSubset

Tokenize social media text with support for URLs, hashtags, mentions, and emojis.

text

string

required

The social media text to tokenize

options

TweetTokenizerOptions

Optional configuration for tweet tokenization

options.stripHandles

boolean

default:"false"

Remove @mentions from the output

options.reduceLen

boolean

default:"false"

Reduce repeated characters to maximum of 3 (e.g., “coooool” → “coool”)

options.matchPhoneNumbers

boolean

default:"true"

Detect and tokenize phone numbers as single tokens

tokens

string[]

Array of tokens including URLs, hashtags, mentions, and emojis

import { tweetTokenizeSubset } from 'bun_nltk';

const tweet = "Check out https://example.com! #NLP @user 🚀";
const tokens = tweetTokenizeSubset(tweet);
console.log(tokens);
// ["Check", "out", "https://example.com", "!", "#NLP", "@user", "🚀"]

// Strip mentions
const noMentions = tweetTokenizeSubset(tweet, { stripHandles: true });
console.log(noMentions);
// ["Check", "out", "https://example.com", "!", "#NLP", "🚀"]

// Reduce repeated characters
const text = "I loooove this sooooo much!!!";
const reduced = tweetTokenizeSubset(text, { reduceLen: true });
console.log(reduced);
// ["I", "loove", "this", "sooo", "much", "!", "!", "!"]

Detected Patterns

URLs: http:// and https:// links
Hashtags: #tag format
Mentions: @username format
Emojis: Unicode emoji sequences
Phone numbers: Various formats (optional)

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

Word Tokenization

wordTokenizeSubset

Contraction Handling

tweetTokenizeSubset

Detected Patterns

Build docs developers (and LLMs) love

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

​wordTokenizeSubset

​Contraction Handling

​tweetTokenizeSubset

​Detected Patterns

Build docs developers (and LLMs) love

wordTokenizeSubset

Contraction Handling

tweetTokenizeSubset

Detected Patterns