Skip to main content

Overview

The native library includes performance-optimized implementations with SIMD (Single Instruction, Multiple Data) acceleration for CPU-intensive operations. Many functions have both SIMD-optimized and scalar fallback implementations.

SIMD-Optimized Functions

Token Counting

countTokensAscii() - SIMD Path

High-performance token counting with SIMD acceleration.
function countTokensAscii(text: string): number
Performance: Processes multiple bytes per CPU cycle using SIMD instructions.

countTokensAsciiScalar() - Scalar Fallback

Scalar implementation for platforms without SIMD support.
function countTokensAsciiScalar(text: string): number

Example & Benchmark

import { 
  countTokensAscii, 
  countTokensAsciiScalar 
} from "bun_nltk";

const text = "The quick brown fox jumps".repeat(1000);

// SIMD path (default)
console.time("SIMD");
const countSIMD = countTokensAscii(text);
console.timeEnd("SIMD"); // ~0.2ms

// Scalar fallback
console.time("Scalar");
const countScalar = countTokensAsciiScalar(text);
console.timeEnd("Scalar"); // ~0.8ms

console.log(countSIMD === countScalar); // true
SIMD path is typically 3-4x faster than scalar for token counting on long texts.

Token Normalization

countNormalizedTokensAscii() - SIMD Path

Counts normalized tokens with SIMD-accelerated character classification.
function countNormalizedTokensAscii(
  text: string, 
  removeStopwords?: boolean
): number

countNormalizedTokensAsciiScalar() - Scalar Fallback

Scalar implementation for normalized token counting.
function countNormalizedTokensAsciiScalar(
  text: string,
  removeStopwords?: boolean
): number

SIMD Optimizations

The SIMD path accelerates:
  • Lowercase conversion (vectorized case checks)
  • Whitespace detection (parallel byte comparisons)
  • Punctuation filtering (batch character class checks)
  • Stopword matching (SIMD hash lookups)

Example

const text = "The QUICK brown Fox Jumps Over the lazy DOG!".repeat(100);

// SIMD-optimized (default)
const countSIMD = countNormalizedTokensAscii(text, true);

// Scalar fallback
const countScalar = countNormalizedTokensAsciiScalar(text, true);

console.log(countSIMD === countScalar); // true

Frequency Distribution Streaming

NativeFreqDistStream

Streaming frequency distribution with memory-efficient processing.
class NativeFreqDistStream {
  constructor();
  update(text: string): void;
  flush(): void;
  tokenUniqueCount(): number;
  bigramUniqueCount(): number;
  conditionalUniqueCount(): number;
  tokenFreqDistHash(): Map<bigint, number>;
  bigramFreqDistHash(): StreamBigramFreq[];
  conditionalFreqDistHash(): StreamConditionalFreq[];
  toJson(): string;
  dispose(): void;
}

Use Case: Large Document Processing

For processing large documents that don’t fit in memory:
import { NativeFreqDistStream } from "bun_nltk";

const stream = new NativeFreqDistStream();

// Process document in chunks
for await (const chunk of readFileInChunks("large.txt")) {
  stream.update(chunk);
}

stream.flush();

// Get frequency distributions
const tokenFreqs = stream.tokenFreqDistHash();
const bigramFreqs = stream.bigramFreqDistHash();

console.log(`Unique tokens: ${stream.tokenUniqueCount()}`);
console.log(`Unique bigrams: ${stream.bigramUniqueCount()}`);

stream.dispose();

Memory Efficiency

  • Incremental updates: Process text in chunks without loading entire document
  • Hash-based deduplication: O(1) unique token tracking
  • Native memory management: Frequency maps stored in Zig, not JS heap

Example: Real-time Analysis

const stream = new NativeFreqDistStream();

// Accumulate from multiple sources
stream.update(tweet1);
stream.update(tweet2);
stream.update(tweet3);

// Get statistics anytime
console.log(`Running unique tokens: ${stream.tokenUniqueCount()}`);

// Continue updating
stream.update(tweet4);
stream.flush();

// Export to JSON
const json = stream.toJson();
fs.writeFileSync("stats.json", json);

stream.dispose();

High-Performance APIs

computeAsciiMetrics()

Computes multiple metrics in a single pass.
function computeAsciiMetrics(text: string, n: number): AsciiMetrics

type AsciiMetrics = {
  tokens: number;
  uniqueTokens: number;
  ngrams: number;
  uniqueNgrams: number;
};

Performance Benefits

  • Single pass: One scan through text for all metrics
  • Cache-friendly: Minimizes memory access patterns
  • SIMD-accelerated: Vectorized text scanning

Example

const text = readFileSync("document.txt", "utf-8");

// Efficient: Single pass for all metrics
const metrics = computeAsciiMetrics(text, 2);

console.log(`Tokens: ${metrics.tokens}`);
console.log(`Unique: ${metrics.uniqueTokens}`);
console.log(`Bigrams: ${metrics.ngrams}`);
console.log(`Unique bigrams: ${metrics.uniqueNgrams}`);

// Inefficient alternative (multiple passes):
const tokens = countTokensAscii(text);
const uniqueTokens = countUniqueTokensAscii(text);
const ngrams = countNgramsAscii(text, 2);
const uniqueNgrams = countUniqueNgramsAscii(text, 2);

Collocation Analysis

topPmiBigramsAscii()

Finds top-scoring bigrams by Pointwise Mutual Information.
function topPmiBigramsAscii(
  text: string,
  topK: number,
  windowSize?: number
): PmiBigram[]

type PmiBigram = {
  leftHash: bigint;
  rightHash: bigint;
  score: number;
};

Performance Optimizations

  • Windowed scanning: Configurable context window
  • Top-K heap: O(n log k) for best bigrams
  • Hash-based: Avoids string allocations

Example

const text = readFileSync("corpus.txt", "utf-8");

// Find top 50 collocations with window=5
const collocations = topPmiBigramsAscii(text, 50, 5);

collocations.forEach(({ leftHash, rightHash, score }) => {
  console.log(`Hash pair: ${leftHash}-${rightHash}, PMI: ${score}`);
});

bigramWindowStatsAscii()

Computes complete bigram statistics with windowing.
function bigramWindowStatsAscii(
  text: string,
  windowSize?: number
): BigramWindowStatToken[]

type BigramWindowStatToken = {
  left: string;
  right: string;
  leftId: number;
  rightId: number;
  count: number;
  pmi: number;
};

Example

const stats = bigramWindowStatsAscii("the quick brown fox", 2);

stats.forEach(({ left, right, count, pmi }) => {
  console.log(`${left} -> ${right}: count=${count}, pmi=${pmi.toFixed(3)}`);
});

Zero-Copy APIs

Some native functions return token IDs and hashes instead of strings to avoid allocations.

tokenFreqDistIdsAscii()

Returns frequency distribution with token IDs for zero-copy processing.
function tokenFreqDistIdsAscii(text: string): TokenFreqDistIds

type TokenFreqDistIds = {
  tokens: string[];              // Unique tokens
  counts: number[];              // Frequency counts
  tokenToId: Map<string, number>; // Token -> ID mapping
  totalTokens: number;           // Total token count
};

Example

const text = "the quick brown fox jumps over the lazy dog";
const freqDist = tokenFreqDistIdsAscii(text);

freqDist.tokens.forEach((token, id) => {
  console.log(`${token} (id=${id}): ${freqDist.counts[id]}`);
});

console.log(`Total tokens: ${freqDist.totalTokens}`);
console.log(`Unique tokens: ${freqDist.tokens.length}`);

Hash-Based APIs

tokenFreqDistHashAscii()

Returns frequency distribution using hashes (no string allocation).
function tokenFreqDistHashAscii(text: string): Map<bigint, number>

ngramFreqDistHashAscii()

Returns n-gram frequency distribution using hashes.
function ngramFreqDistHashAscii(
  text: string,
  n: number
): Map<bigint, number>

Performance Benefits

  • Zero string allocation: Uses 64-bit hashes instead of strings
  • Fast comparisons: Integer comparison vs string comparison
  • Compact storage: 8 bytes per hash vs variable string size

Example

// Hash-based (fastest)
const hashFreqs = tokenFreqDistHashAscii(text);
hashFreqs.forEach((count, hash) => {
  console.log(`Hash ${hash}: ${count}`);
});

// ID-based (with strings)
const idFreqs = tokenFreqDistIdsAscii(text);
idFreqs.tokens.forEach((token, id) => {
  console.log(`${token}: ${idFreqs.counts[id]}`);
});

Batch Processing

perceptronPredictBatchNative()

Batch prediction for POS tagging.
function perceptronPredictBatchNative(
  featureIds: Uint32Array,
  tokenOffsets: Uint32Array,
  weights: Float32Array,
  modelFeatureCount: number,
  tagCount: number
): Uint16Array

Performance

  • Vectorized dot products: SIMD-accelerated weight × feature
  • Batch processing: Amortizes function call overhead
  • Zero-copy output: Returns typed array directly

linearScoresSparseIdsNative()

Sparse linear model scoring for classification.
function linearScoresSparseIdsNative(input: {
  docOffsets: Uint32Array;
  featureIds: Uint32Array;
  featureValues: Float64Array;
  classCount: number;
  featureCount: number;
  weights: Float64Array;
  bias: Float64Array;
}): Float64Array

Sparse Format Benefits

  • Memory efficient: Only store non-zero features
  • Fast for high-dimensional data: Skip zero-weight features
  • Cache-friendly: Sequential access patterns

Performance Best Practices

1. Use Batch APIs

// Good: Batch processing
const tags = perceptronPredictBatchNative(
  featureIds, tokenOffsets, weights, featureCount, tagCount
);

// Bad: Per-token processing
for (const token of tokens) {
  const tag = perceptronPredictSingle(token);
}

2. Reuse Typed Arrays

// Good: Reuse buffers
const outputBuffer = new Float64Array(classCount);
for (const doc of docs) {
  const scores = classifyWithBuffer(doc, outputBuffer);
}

// Bad: New allocation per call
for (const doc of docs) {
  const scores = new Float64Array(classify(doc));
}

3. Use Hash-Based APIs for Large Vocabularies

// Good: Hash-based for large corpus
const freqs = tokenFreqDistHashAscii(largeCorpus);

// Less efficient: String-based
const freqsStr = tokenFreqDistIdsAscii(largeCorpus);

4. Stream Large Documents

// Good: Streaming for large files
const stream = new NativeFreqDistStream();
for (const chunk of readFileInChunks("huge.txt")) {
  stream.update(chunk);
}

// Bad: Load entire file
const text = readFileSync("huge.txt", "utf-8");
const freqs = tokenFreqDistHashAscii(text);

SIMD Architecture Details

Supported Instruction Sets

  • x86_64: SSE4.2, AVX2, AVX-512 (runtime detection)
  • ARM64: NEON (always available on supported platforms)
  • Fallback: Scalar implementation for other platforms

Runtime Detection

The library automatically detects CPU capabilities and selects the best implementation:
// Automatically uses best available implementation
const count = countTokensAscii(text);
// -> Uses AVX2 on Intel Haswell+
// -> Uses NEON on ARM64
// -> Uses scalar on other platforms

See Also

Build docs developers (and LLMs) love