N-grams

ngramsAsciiNative

Extracts all n-grams of a specified size from text using native SIMD acceleration.

text

string

required

Input text to extract n-grams from

number

required

Size of n-grams to extract (must be positive integer)

return

string[][]

Array of n-grams, where each n-gram is an array of n tokens

import { ngramsAsciiNative } from 'bun_nltk';

const text = "the quick brown fox";
const bigrams = ngramsAsciiNative(text, 2);
// Returns: [["the", "quick"], ["quick", "brown"], ["brown", "fox"]]

const trigrams = ngramsAsciiNative(text, 3);
// Returns: [["the", "quick", "brown"], ["quick", "brown", "fox"]]

Uses native implementation with token ID encoding for optimal performance on large texts.

everygramsAsciiNative

Extracts all n-grams within a length range (combines unigrams, bigrams, trigrams, etc.).

text

string

required

Input text to extract everygrams from

minLen

number

default:"1"

Minimum n-gram length (must be positive integer)

maxLen

number

default:"Number.MAX_SAFE_INTEGER"

Maximum n-gram length (must be positive integer)

return

string[][]

Array of n-grams of varying lengths

import { everygramsAsciiNative } from 'bun_nltk';

const text = "the quick brown";
const grams = everygramsAsciiNative(text, 1, 2);
// Returns: [
//   ["the"],
//   ["the", "quick"],
//   ["quick"],
//   ["quick", "brown"],
//   ["brown"]
// ]

const text2 = "hello world";
const all = everygramsAsciiNative(text2, 1, 3);
// Returns: [["hello"], ["hello", "world"], ["world"]]
// (no trigram since only 2 tokens)

Useful for feature extraction when you want to capture patterns of multiple lengths.

skipgramsAsciiNative

Extracts skipgrams (n-grams that allow gaps between tokens).

text

string

required

Input text to extract skipgrams from

number

required

Number of tokens in each skipgram (must be positive integer)

number

required

Maximum gap size between tokens (must be integer >= 0)

return

string[][]

Array of skipgrams, where each skipgram has n tokens

import { skipgramsAsciiNative } from 'bun_nltk';

const text = "the quick brown fox jumps";

// Standard bigrams (k=0, no gaps)
const bigrams = skipgramsAsciiNative(text, 2, 0);
// Returns: [["the", "quick"], ["quick", "brown"], ["brown", "fox"], ["fox", "jumps"]]

// Skip-bigrams with gap up to 1 (k=1)
const skipBigrams = skipgramsAsciiNative(text, 2, 1);
// Returns: [
//   ["the", "quick"],  // no gap
//   ["the", "brown"],  // gap of 1
//   ["quick", "brown"], // no gap
//   ["quick", "fox"],   // gap of 1
//   ["brown", "fox"],   // no gap
//   ["brown", "jumps"], // gap of 1
//   ["fox", "jumps"]    // no gap
// ]

// Skip-trigrams with gap up to 2 (k=2)
const skipTrigrams = skipgramsAsciiNative(text, 3, 2);
// Generates all possible selections of 3 tokens
// within a window of 3+2=5 positions

Large values of k can generate a very large number of skipgrams. For n=3, k=2 on a 100-token text, you may get thousands of skipgrams.

Performance Comparison

import { ngramsAsciiNative } from 'bun_nltk';
import { ngramsAscii } from 'bun_nltk'; // JS reference implementation

const largeText = "...".repeat(10000); // Large document

// Native implementation (SIMD-accelerated)
console.time('native');
const result1 = ngramsAsciiNative(largeText, 3);
console.timeEnd('native'); // ~5ms

// JavaScript reference implementation
console.time('js');
const result2 = ngramsAscii(largeText, 3);
console.timeEnd('js'); // ~45ms

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

ngramsAsciiNative

everygramsAsciiNative

skipgramsAsciiNative

Performance Comparison

Build docs developers (and LLMs) love

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

​ngramsAsciiNative

​everygramsAsciiNative

​skipgramsAsciiNative

​Performance Comparison

Build docs developers (and LLMs) love

ngramsAsciiNative

everygramsAsciiNative

skipgramsAsciiNative

Performance Comparison