Performance - bun

Performance Overview

bun_nltk achieves 3.64x to 840x faster performance than Python NLTK through a combination of:

Zig native compilation to optimized machine code
SIMD vectorization for hot paths (x86_64)
Zero-copy memory access via FFI
Efficient data structures (hash tables, arena allocators)
Algorithmic optimizations specific to NLP tasks

This page explains the key optimizations and provides detailed benchmark data.

Benchmark Results

Large Dataset Performance (64MB)

All benchmarks use a 64MB synthetic dataset (bench/datasets/synthetic.txt).

Workload	Zig/Bun (sec)	Python (sec)	Speedup	Faster %
Token + unique + ngram + unique ngram	2.767	10.071	3.64x	263.93%
Top-K PMI collocations	2.090	23.945	11.46x	1045.90%
Porter stemming	11.942	120.101	10.06x	905.70%
WASM token/ngram path	4.150	13.241	3.19x	219.06%
Native vs Python (WASM suite)	1.719	13.241	7.70x	670.48%
Sentence tokenizer subset	1.680	16.580	9.87x	886.70%
Perceptron POS tagger	19.880	82.849	4.17x	316.75%
Streaming FreqDist + ConditionalFreqDist	3.206	20.971	6.54x	554.17%

Small Dataset Performance (8MB)

Gate dataset benchmarks for precision-critical workloads:

Workload	Zig/Bun (sec)	Python (sec)	Speedup	Faster %
Punkt tokenizer default path	0.0848	1.3463	15.87x	1487.19%
N-gram LM (Kneser-Ney)	0.1324	2.8661	21.64x	2064.19%
Regexp chunk parser	0.0024	1.5511	643x	64208.28%
WordNet lookup + morphy	0.0009	0.0835	91.55x	9054.67%
CFG chart parser subset	0.0088	0.3292	37.51x	3651.05%
Naive Bayes text classifier	0.0081	0.0112	1.38x	38.40%
PCFG Viterbi chart parser	0.0191	0.4153	21.80x	2080.00%
MaxEnt text classifier	0.0244	0.1824	7.46x	646.00%
Sparse linear logits hot loop	0.0024	2.0001	840x	83954.04%
Decision tree text classifier	0.0725	0.5720	7.89x	688.55%
Earley parser workload	0.1149	4.6483	40.47x	3947.07%

The sparse linear logits workload shows the extreme case: 840x faster due to cache-friendly native memory access patterns.

SIMD Acceleration Results

SIMD fast path vs scalar baseline (x86_64 only):

Operation	SIMD (sec)	Scalar (sec)	Speedup
`countTokensAscii`	-	-	1.22x
Normalization (no stopwords)	-	-	2.73x

SIMD paths are automatically enabled on x86_64 processors.

Key Optimization Techniques

1. SIMD Vectorization

SIMD (Single Instruction, Multiple Data) processes 16 bytes at once using CPU vector instructions.

Token Counting with SIMD

pub fn tokenCountAscii(input: []const u8) u64 {
    if (input.len >= 64 and builtin.cpu.arch == .x86_64) {
        return tokenCountAsciiSimd16(input); // SIMD path
    }
    return tokenCountAsciiScalar(input); // Scalar fallback
}

fn tokenCountAsciiSimd16(input: []const u8) u64 {
    const lanes = 16;
    const Vec = @Vector(lanes, u8);
    
    var total: u64 = 0;
    var in_token = false;
    var idx: usize = 0;
    
    // Process 16 bytes per iteration
    while (idx + lanes <= input.len) : (idx += lanes) {
        const vec: Vec = input[idx..][0..lanes].*;
        const token_flags: [lanes]bool = tokenCharMask16(vec);
        
        for (token_flags) |is_token| {
            if (is_token) {
                if (!in_token) {
                    total += 1;
                    in_token = true;
                }
            } else {
                in_token = false;
            }
        }
    }
    
    // Handle remaining bytes
    while (idx < input.len) : (idx += 1) {
        // ... scalar path
    }
    
    return total;
}

Vectorized Character Classification

fn tokenCharMask16(chunk: @Vector(16, u8)) [16]bool {
    const upper = (chunk >= @splat(u8, 'A')) & 
                  (chunk <= @splat(u8, 'Z'));
    const lower = (chunk >= @splat(u8, 'a')) & 
                  (chunk <= @splat(u8, 'z'));
    const digit = (chunk >= @splat(u8, '0')) & 
                  (chunk <= @splat(u8, '9'));
    const apostrophe = chunk == @splat(u8, '\'');
    
    const mask = upper | lower | digit | apostrophe;
    return mask;
}

This checks 16 characters simultaneously for [A-Za-z0-9'] pattern. Performance impact: 1.22x speedup for token counting, 2.73x for normalization.

2. Zero-Copy FFI

The native runtime avoids copying input data by passing pointers directly:

function toBuffer(text: string): Uint8Array {
  return new TextEncoder().encode(text);
}

export function countTokensAscii(text: string): number {
  const bytes = toBuffer(text);
  if (bytes.length === 0) return 0;
  
  // ptr() passes direct pointer to bytes.buffer - zero copy
  const value = lib.symbols.bunnltk_count_tokens_ascii(
    ptr(bytes),
    bytes.length
  );
  return toNumber(value);
}

Zig receives the pointer and accesses JavaScript memory directly:

export fn bunnltk_count_tokens_ascii(
    input_ptr: [*]const u8,
    input_len: usize
) u64 {
    const input = input_ptr[0..input_len]; // Zero-copy slice
    return ascii.tokenCountAscii(input);
}

Performance impact: Eliminates memory copy overhead for all input data.

3. Efficient Hash Tables

Frequency distributions use FNV-1a hashing for fast token deduplication:

pub const FNV_OFFSET_BASIS: u64 = 14695981039346656037;
pub const FNV_PRIME: u64 = 1099511628211;

pub fn tokenHashUpdate(hash: u64, ch: u8) u64 {
    var next = hash;
    next ^= @as(u64, asciiLower(ch)); // XOR with lowercased char
    next *%= FNV_PRIME;               // Wrapping multiply
    return next;
}

Tokens are hashed as they’re scanned, avoiding string allocation:

pub fn tokenFreqDistHashAscii(
    input: []const u8,
    allocator: std.mem.Allocator
) !std.AutoHashMap(u64, u64) {
    var map = std.AutoHashMap(u64, u64).init(allocator);
    
    var in_token = false;
    var token_hash: u64 = FNV_OFFSET_BASIS;
    
    for (input) |ch| {
        if (ascii.isTokenChar(ch)) {
            if (!in_token) {
                in_token = true;
                token_hash = FNV_OFFSET_BASIS;
            }
            token_hash = ascii.tokenHashUpdate(token_hash, ch);
        } else if (in_token) {
            const entry = try map.getOrPut(token_hash);
            if (!entry.found_existing) {
                entry.value_ptr.* = 0;
            }
            entry.value_ptr.* += 1;
            in_token = false;
        }
    }
    
    return map;
}

Performance impact: Single-pass tokenization + frequency counting.

4. Arena Allocators

Temporary allocations use arena allocators that free in bulk:

pub fn computeNgramStats(
    input: []const u8,
    n: u32,
    arena: *std.heap.ArenaAllocator
) !NgramStats {
    const allocator = arena.allocator();
    
    // All allocations use arena
    const hashes = try collectTokenHashesAscii(input, allocator);
    const unique_map = try buildNgramMap(hashes, n, allocator);
    
    // Arena frees everything at once - no individual frees
    return stats;
}

Performance impact: Reduces allocation overhead, eliminates fragmentation.

5. Inline Functions

Hot path functions are marked inline for zero-cost abstraction:

pub inline fn isTokenChar(ch: u8) bool {
    return std.ascii.isAlphanumeric(ch) or ch == '\'';
}

pub inline fn asciiLower(ch: u8) u8 {
    if (ch >= 'A' and ch <= 'Z') {
        return ch + 32;
    }
    return ch;
}

Performance impact: Eliminates function call overhead in tight loops.

6. Memory Layout Optimization

Data structures are packed for cache efficiency:

const BigramStat = struct {
    left_id: u32,   // 4 bytes
    right_id: u32,  // 4 bytes
    count: u64,     // 8 bytes
    pmi: f64,       // 8 bytes
    // Total: 24 bytes (3 cache lines on typical CPU)
};

Compare to Python’s object overhead (48+ bytes per object).

Memory Efficiency

Memory Usage Comparison

Processing 64MB text corpus:

Runtime	Peak Memory	Description
Python NLTK	~450 MB	Object overhead, GC pressure
bun_nltk native	~120 MB	Efficient C structures
bun_nltk WASM	~150 MB	WASM linear memory pool

Memory Reuse in WASM

The WASM runtime reuses memory blocks across calls:

private ensureBlock(key: string, bytes: number): PoolBlock {
  const existing = this.blocks.get(key);
  if (existing && existing.bytes >= bytes) {
    return existing; // Reuse existing block
  }
  
  if (existing) {
    // Free old block
    this.exports.bunnltk_wasm_free(existing.ptr, existing.bytes);
  }
  
  // Allocate larger block
  const ptr = this.exports.bunnltk_wasm_alloc(bytes);
  const block = { ptr, bytes };
  this.blocks.set(key, block);
  return block;
}

Example:

const wasm = await WasmNltk.init();

// Allocates "offsets" block (40KB)
wasm.tokenizeAscii(text1); // 10,000 tokens

// Reuses same block
wasm.tokenizeAscii(text2); // 8,000 tokens

// Reallocates larger block (80KB)
wasm.tokenizeAscii(text3); // 20,000 tokens

Algorithm-Specific Optimizations

Porter Stemmer

In-place string modification avoids allocations:

pub fn porterStem(input: []const u8, out: []u8) u32 {
    // Copy to output buffer
    const len = @min(input.len, out.len);
    @memcpy(out[0..len], input[0..len]);
    
    var m = Stem{ .b = out, .j = len };
    
    // Modify in-place (no allocations)
    m.step1ab();
    m.step1c();
    m.step2();
    m.step3();
    m.step4();
    m.step5();
    
    return m.j; // Final length
}

Result: 10.06x faster than Python NLTK’s Porter stemmer.

Punkt Sentence Tokenizer

Single-pass state machine with heuristics:

pub fn countSentencesPunktAscii(input: []const u8) u64 {
    var count: u64 = 0;
    var i: usize = 0;
    
    while (i < input.len) {
        const ch = input[i];
        
        if (ch == '.' or ch == '!' or ch == '?') {
            // Check for sentence boundary heuristics
            if (isSentenceBoundary(input, i)) {
                count += 1;
            }
        }
        i += 1;
    }
    
    return count;
}

Result: 15.87x faster than Python NLTK’s Punkt tokenizer.

Sparse Linear Scorer

Cache-friendly sparse matrix multiplication:

pub fn linearScoresSparseIds(
    doc_offsets: []const u32,
    feature_ids: []const u32,
    feature_values: []const f64,
    weights: []const f64,
    bias: []const f64,
    class_count: u32,
    feature_count: u32,
    out_scores: []f64
) void {
    const doc_count = doc_offsets.len - 1;
    
    for (0..doc_count) |doc_idx| {
        const start = doc_offsets[doc_idx];
        const end = doc_offsets[doc_idx + 1];
        
        for (0..class_count) |class_idx| {
            var score = bias[class_idx];
            
            // Sparse dot product
            for (start..end) |feat_idx| {
                const fid = feature_ids[feat_idx];
                const fval = feature_values[feat_idx];
                const weight_idx = class_idx * feature_count + fid;
                score += weights[weight_idx] * fval;
            }
            
            out_scores[doc_idx * class_count + class_idx] = score;
        }
    }
}

Result: 840x faster than Python scikit-learn for sparse scoring.

Collocation Finder

Top-K PMI with single-pass counting:

pub fn topPmiBigramsWindow(
    input: []const u8,
    window_size: u32,
    top_k: u32,
    allocator: std.mem.Allocator
) ![]BigramStat {
    // Collect token IDs (first pass)
    const token_ids = try collectTokenIds(input, allocator);
    
    // Count windowed bigrams (second pass)
    var bigram_counts = std.AutoHashMap(BigramKey, u64).init(allocator);
    for (0..token_ids.len) |i| {
        const end = @min(i + window_size, token_ids.len);
        for (i + 1..end) |j| {
            const key = BigramKey{ 
                .left = token_ids[i], 
                .right = token_ids[j] 
            };
            const entry = try bigram_counts.getOrPut(key);
            if (!entry.found_existing) entry.value_ptr.* = 0;
            entry.value_ptr.* += 1;
        }
    }
    
    // Compute PMI scores
    const stats = try computePmiScores(bigram_counts, token_ids, allocator);
    
    // Partial sort for top-K
    std.sort.pdq(BigramStat, stats, {}, comparePmi);
    return stats[0..@min(top_k, stats.len)];
}

Result: 11.46x faster than Python NLTK’s collocation finder.

Performance Tuning Guide

1. Batch Operations

Avoid repeated FFI calls by batching:

// ❌ Slow: 10,000 FFI calls
const stems = tokens.map(token => porterStemAscii(token));

// ✅ Fast: 1 FFI call
const stems = porterStemAsciiTokens(tokens);

2. Use Combined Metrics

Get multiple counts in one pass:

// ❌ Slow: 4 separate FFI calls
const tokenCount = countTokensAscii(text);
const uniqueTokens = countUniqueTokensAscii(text);
const ngramCount = countNgramsAscii(text, 2);
const uniqueNgrams = countUniqueNgramsAscii(text, 2);

// ✅ Fast: 1 FFI call for all 4 metrics
const metrics = computeAsciiMetrics(text, 2);
// => { tokens, uniqueTokens, ngrams, uniqueNgrams }

3. Pre-size Output Buffers

// ❌ Wasteful: Over-allocated buffer
const offsets = new Uint32Array(1000000); // Too large

// ✅ Optimal: Exact size
const capacity = countTokensAscii(text);
const offsets = new Uint32Array(capacity);
const lengths = new Uint32Array(capacity);

4. Reuse WASM Instances

// ❌ Slow: Initialize for each request
app.post('/analyze', async (req) => {
  const wasm = await WasmNltk.init(); // 5-10ms overhead
  const result = wasm.tokenizeAscii(req.body.text);
  wasm.dispose();
  return result;
});

// ✅ Fast: Initialize once
const wasm = await WasmNltk.init();
app.post('/analyze', async (req) => {
  const result = wasm.tokenizeAscii(req.body.text);
  return result;
});

5. Use Native Runtime for Large Data

// For datasets > 10MB, use native runtime
import { tokenizeAsciiNative } from 'bun_nltk/src/native';

const largeCorpus = fs.readFileSync('large.txt', 'utf-8');
const tokens = tokenizeAsciiNative(largeCorpus); // 2-3x faster than WASM

Profiling and Benchmarking

Run Built-in Benchmarks

# Generate 64MB synthetic dataset
bun run bench:generate

# Compare against Python NLTK
bun run bench:compare
bun run bench:compare:collocations
bun run bench:compare:porter
bun run bench:compare:punkt
bun run bench:compare:lm

# Native vs WASM comparison
bun run bench:compare:wasm

# SIMD vs scalar comparison
bun run bench:compare:simd

Custom Benchmarks

import { countTokensAscii } from 'bun_nltk/src/native';

const text = fs.readFileSync('corpus.txt', 'utf-8');

const iterations = 100;
const start = performance.now();

for (let i = 0; i < iterations; i++) {
  countTokensAscii(text);
}

const elapsed = performance.now() - start;
const perIteration = elapsed / iterations;

console.log(`Avg time: ${perIteration.toFixed(2)}ms`);

Performance SLA Gates

bun_nltk includes automated performance gates in CI:

# Run performance regression check
bun run bench:gate

# Run SLA gate (p95 latency + memory)
bun run sla:gate

SLA thresholds are defined in scripts/sla-gate.ts.

Real-World Performance

Case Study: Text Classification Pipeline

Task: Classify 100,000 documents (avg 500 tokens each)

Implementation	Time	Memory
Python NLTK + scikit-learn	145 sec	1.2 GB
bun_nltk native + JS classifier	18 sec	320 MB
Improvement	8x faster	3.75x less

Pipeline:

Tokenization
Stopword removal
Porter stemming
TF-IDF vectorization
Logistic regression classification

Case Study: Real-Time Sentence Segmentation

Task: Segment streaming news articles (avg 2KB each)

Implementation	p50 Latency	p95 Latency	Throughput
Python NLTK	45ms	120ms	22 docs/sec
bun_nltk native	3ms	8ms	333 docs/sec
Improvement	15x faster	15x faster	15x more

Summary

bun_nltk’s performance advantages come from:

Native compilation: Zig compiles to optimized machine code
SIMD acceleration: 16-byte parallel processing on x86_64
Zero-copy FFI: Direct memory access, no serialization
Efficient algorithms: Single-pass processing, cache-friendly data structures
Memory efficiency: Arena allocators, memory pooling, compact layouts

For detailed API usage, see the API Reference. For architecture details, see Architecture.

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

​Performance Overview

​Benchmark Results

​Large Dataset Performance (64MB)

​Small Dataset Performance (8MB)

​SIMD Acceleration Results

​Key Optimization Techniques

​1. SIMD Vectorization

​Token Counting with SIMD

​Vectorized Character Classification

​2. Zero-Copy FFI

​3. Efficient Hash Tables

​4. Arena Allocators

​5. Inline Functions

​6. Memory Layout Optimization

​Memory Efficiency

​Memory Usage Comparison

​Memory Reuse in WASM

​Algorithm-Specific Optimizations

​Porter Stemmer

​Punkt Sentence Tokenizer

​Sparse Linear Scorer

​Collocation Finder

​Performance Tuning Guide

​1. Batch Operations

​2. Use Combined Metrics

​3. Pre-size Output Buffers

​4. Reuse WASM Instances

​5. Use Native Runtime for Large Data

​Profiling and Benchmarking

​Run Built-in Benchmarks

​Custom Benchmarks

​Performance SLA Gates

​Real-World Performance

​Case Study: Text Classification Pipeline

​Case Study: Real-Time Sentence Segmentation

​Summary

Build docs developers (and LLMs) love

Performance Overview

Benchmark Results

Large Dataset Performance (64MB)

Small Dataset Performance (8MB)

SIMD Acceleration Results

Key Optimization Techniques

1. SIMD Vectorization

Token Counting with SIMD

Vectorized Character Classification

2. Zero-Copy FFI

3. Efficient Hash Tables

4. Arena Allocators

5. Inline Functions

6. Memory Layout Optimization

Memory Efficiency

Memory Usage Comparison

Memory Reuse in WASM

Algorithm-Specific Optimizations

Porter Stemmer

Punkt Sentence Tokenizer

Sparse Linear Scorer

Collocation Finder

Performance Tuning Guide

1. Batch Operations

2. Use Combined Metrics

3. Pre-size Output Buffers

4. Reuse WASM Instances

5. Use Native Runtime for Large Data

Profiling and Benchmarking

Run Built-in Benchmarks

Custom Benchmarks

Performance SLA Gates

Real-World Performance

Case Study: Text Classification Pipeline

Case Study: Real-Time Sentence Segmentation

Summary