ASCII Tokenizer

Auto-generate your docs

countTokensAscii
tokenizeAsciiNative
Notes

countTokensAscii

Count the total number of tokens in ASCII text using SIMD-accelerated native implementation.

text

string

required

The ASCII text to tokenize

count

number

Total number of tokens in the text

import { countTokensAscii } from 'bun_nltk';

const text = "Hello world! This is a test.";
const count = countTokensAscii(text);
console.log(count); // 6

tokenizeAsciiNative

Tokenize ASCII text into an array of lowercase tokens using native implementation.

text

string

required

The ASCII text to tokenize

tokens

string[]

Array of lowercase tokens extracted from the text

import { tokenizeAsciiNative } from 'bun_nltk';

const text = "Hello World! How are you?";
const tokens = tokenizeAsciiNative(text);
console.log(tokens);
// ["hello", "world", "how", "are", "you"]

Notes

Tokens are automatically converted to lowercase
Uses SIMD vectorization for high performance
Optimized for ASCII text; may not handle Unicode correctly
Punctuation is typically filtered out during tokenization

Word Tokenization

⌘I

Build docs developers (and LLMs) love

Get started for free Talk to us

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

countTokensAscii

tokenizeAsciiNative

Notes

Build docs developers (and LLMs) love

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

​countTokensAscii

​tokenizeAsciiNative

​Notes

Build docs developers (and LLMs) love

countTokensAscii

tokenizeAsciiNative

Notes