bun_nltk achieves 3.64x to 840x faster performance than Python NLTK through a combination of:
- Zig native compilation to optimized machine code
- SIMD vectorization for hot paths (x86_64)
- Zero-copy memory access via FFI
- Efficient data structures (hash tables, arena allocators)
- Algorithmic optimizations specific to NLP tasks
This page explains the key optimizations and provides detailed benchmark data.
Benchmark Results
All benchmarks use a 64MB synthetic dataset (bench/datasets/synthetic.txt).
| Workload | Zig/Bun (sec) | Python (sec) | Speedup | Faster % |
|---|
| Token + unique + ngram + unique ngram | 2.767 | 10.071 | 3.64x | 263.93% |
| Top-K PMI collocations | 2.090 | 23.945 | 11.46x | 1045.90% |
| Porter stemming | 11.942 | 120.101 | 10.06x | 905.70% |
| WASM token/ngram path | 4.150 | 13.241 | 3.19x | 219.06% |
| Native vs Python (WASM suite) | 1.719 | 13.241 | 7.70x | 670.48% |
| Sentence tokenizer subset | 1.680 | 16.580 | 9.87x | 886.70% |
| Perceptron POS tagger | 19.880 | 82.849 | 4.17x | 316.75% |
| Streaming FreqDist + ConditionalFreqDist | 3.206 | 20.971 | 6.54x | 554.17% |
Gate dataset benchmarks for precision-critical workloads:
| Workload | Zig/Bun (sec) | Python (sec) | Speedup | Faster % |
|---|
| Punkt tokenizer default path | 0.0848 | 1.3463 | 15.87x | 1487.19% |
| N-gram LM (Kneser-Ney) | 0.1324 | 2.8661 | 21.64x | 2064.19% |
| Regexp chunk parser | 0.0024 | 1.5511 | 643x | 64208.28% |
| WordNet lookup + morphy | 0.0009 | 0.0835 | 91.55x | 9054.67% |
| CFG chart parser subset | 0.0088 | 0.3292 | 37.51x | 3651.05% |
| Naive Bayes text classifier | 0.0081 | 0.0112 | 1.38x | 38.40% |
| PCFG Viterbi chart parser | 0.0191 | 0.4153 | 21.80x | 2080.00% |
| MaxEnt text classifier | 0.0244 | 0.1824 | 7.46x | 646.00% |
| Sparse linear logits hot loop | 0.0024 | 2.0001 | 840x | 83954.04% |
| Decision tree text classifier | 0.0725 | 0.5720 | 7.89x | 688.55% |
| Earley parser workload | 0.1149 | 4.6483 | 40.47x | 3947.07% |
The sparse linear logits workload shows the extreme case: 840x faster due to cache-friendly native memory access patterns.
SIMD Acceleration Results
SIMD fast path vs scalar baseline (x86_64 only):
| Operation | SIMD (sec) | Scalar (sec) | Speedup |
|---|
countTokensAscii | - | - | 1.22x |
| Normalization (no stopwords) | - | - | 2.73x |
SIMD paths are automatically enabled on x86_64 processors.
Key Optimization Techniques
1. SIMD Vectorization
SIMD (Single Instruction, Multiple Data) processes 16 bytes at once using CPU vector instructions.
Token Counting with SIMD
pub fn tokenCountAscii(input: []const u8) u64 {
if (input.len >= 64 and builtin.cpu.arch == .x86_64) {
return tokenCountAsciiSimd16(input); // SIMD path
}
return tokenCountAsciiScalar(input); // Scalar fallback
}
fn tokenCountAsciiSimd16(input: []const u8) u64 {
const lanes = 16;
const Vec = @Vector(lanes, u8);
var total: u64 = 0;
var in_token = false;
var idx: usize = 0;
// Process 16 bytes per iteration
while (idx + lanes <= input.len) : (idx += lanes) {
const vec: Vec = input[idx..][0..lanes].*;
const token_flags: [lanes]bool = tokenCharMask16(vec);
for (token_flags) |is_token| {
if (is_token) {
if (!in_token) {
total += 1;
in_token = true;
}
} else {
in_token = false;
}
}
}
// Handle remaining bytes
while (idx < input.len) : (idx += 1) {
// ... scalar path
}
return total;
}
Vectorized Character Classification
fn tokenCharMask16(chunk: @Vector(16, u8)) [16]bool {
const upper = (chunk >= @splat(u8, 'A')) &
(chunk <= @splat(u8, 'Z'));
const lower = (chunk >= @splat(u8, 'a')) &
(chunk <= @splat(u8, 'z'));
const digit = (chunk >= @splat(u8, '0')) &
(chunk <= @splat(u8, '9'));
const apostrophe = chunk == @splat(u8, '\'');
const mask = upper | lower | digit | apostrophe;
return mask;
}
This checks 16 characters simultaneously for [A-Za-z0-9'] pattern.
Performance impact: 1.22x speedup for token counting, 2.73x for normalization.
2. Zero-Copy FFI
The native runtime avoids copying input data by passing pointers directly:
function toBuffer(text: string): Uint8Array {
return new TextEncoder().encode(text);
}
export function countTokensAscii(text: string): number {
const bytes = toBuffer(text);
if (bytes.length === 0) return 0;
// ptr() passes direct pointer to bytes.buffer - zero copy
const value = lib.symbols.bunnltk_count_tokens_ascii(
ptr(bytes),
bytes.length
);
return toNumber(value);
}
Zig receives the pointer and accesses JavaScript memory directly:
export fn bunnltk_count_tokens_ascii(
input_ptr: [*]const u8,
input_len: usize
) u64 {
const input = input_ptr[0..input_len]; // Zero-copy slice
return ascii.tokenCountAscii(input);
}
Performance impact: Eliminates memory copy overhead for all input data.
3. Efficient Hash Tables
Frequency distributions use FNV-1a hashing for fast token deduplication:
pub const FNV_OFFSET_BASIS: u64 = 14695981039346656037;
pub const FNV_PRIME: u64 = 1099511628211;
pub fn tokenHashUpdate(hash: u64, ch: u8) u64 {
var next = hash;
next ^= @as(u64, asciiLower(ch)); // XOR with lowercased char
next *%= FNV_PRIME; // Wrapping multiply
return next;
}
Tokens are hashed as they’re scanned, avoiding string allocation:
pub fn tokenFreqDistHashAscii(
input: []const u8,
allocator: std.mem.Allocator
) !std.AutoHashMap(u64, u64) {
var map = std.AutoHashMap(u64, u64).init(allocator);
var in_token = false;
var token_hash: u64 = FNV_OFFSET_BASIS;
for (input) |ch| {
if (ascii.isTokenChar(ch)) {
if (!in_token) {
in_token = true;
token_hash = FNV_OFFSET_BASIS;
}
token_hash = ascii.tokenHashUpdate(token_hash, ch);
} else if (in_token) {
const entry = try map.getOrPut(token_hash);
if (!entry.found_existing) {
entry.value_ptr.* = 0;
}
entry.value_ptr.* += 1;
in_token = false;
}
}
return map;
}
Performance impact: Single-pass tokenization + frequency counting.
4. Arena Allocators
Temporary allocations use arena allocators that free in bulk:
pub fn computeNgramStats(
input: []const u8,
n: u32,
arena: *std.heap.ArenaAllocator
) !NgramStats {
const allocator = arena.allocator();
// All allocations use arena
const hashes = try collectTokenHashesAscii(input, allocator);
const unique_map = try buildNgramMap(hashes, n, allocator);
// Arena frees everything at once - no individual frees
return stats;
}
Performance impact: Reduces allocation overhead, eliminates fragmentation.
5. Inline Functions
Hot path functions are marked inline for zero-cost abstraction:
pub inline fn isTokenChar(ch: u8) bool {
return std.ascii.isAlphanumeric(ch) or ch == '\'';
}
pub inline fn asciiLower(ch: u8) u8 {
if (ch >= 'A' and ch <= 'Z') {
return ch + 32;
}
return ch;
}
Performance impact: Eliminates function call overhead in tight loops.
6. Memory Layout Optimization
Data structures are packed for cache efficiency:
const BigramStat = struct {
left_id: u32, // 4 bytes
right_id: u32, // 4 bytes
count: u64, // 8 bytes
pmi: f64, // 8 bytes
// Total: 24 bytes (3 cache lines on typical CPU)
};
Compare to Python’s object overhead (48+ bytes per object).
Memory Efficiency
Memory Usage Comparison
Processing 64MB text corpus:
| Runtime | Peak Memory | Description |
|---|
| Python NLTK | ~450 MB | Object overhead, GC pressure |
| bun_nltk native | ~120 MB | Efficient C structures |
| bun_nltk WASM | ~150 MB | WASM linear memory pool |
Memory Reuse in WASM
The WASM runtime reuses memory blocks across calls:
private ensureBlock(key: string, bytes: number): PoolBlock {
const existing = this.blocks.get(key);
if (existing && existing.bytes >= bytes) {
return existing; // Reuse existing block
}
if (existing) {
// Free old block
this.exports.bunnltk_wasm_free(existing.ptr, existing.bytes);
}
// Allocate larger block
const ptr = this.exports.bunnltk_wasm_alloc(bytes);
const block = { ptr, bytes };
this.blocks.set(key, block);
return block;
}
Example:
const wasm = await WasmNltk.init();
// Allocates "offsets" block (40KB)
wasm.tokenizeAscii(text1); // 10,000 tokens
// Reuses same block
wasm.tokenizeAscii(text2); // 8,000 tokens
// Reallocates larger block (80KB)
wasm.tokenizeAscii(text3); // 20,000 tokens
Algorithm-Specific Optimizations
Porter Stemmer
In-place string modification avoids allocations:
pub fn porterStem(input: []const u8, out: []u8) u32 {
// Copy to output buffer
const len = @min(input.len, out.len);
@memcpy(out[0..len], input[0..len]);
var m = Stem{ .b = out, .j = len };
// Modify in-place (no allocations)
m.step1ab();
m.step1c();
m.step2();
m.step3();
m.step4();
m.step5();
return m.j; // Final length
}
Result: 10.06x faster than Python NLTK’s Porter stemmer.
Punkt Sentence Tokenizer
Single-pass state machine with heuristics:
pub fn countSentencesPunktAscii(input: []const u8) u64 {
var count: u64 = 0;
var i: usize = 0;
while (i < input.len) {
const ch = input[i];
if (ch == '.' or ch == '!' or ch == '?') {
// Check for sentence boundary heuristics
if (isSentenceBoundary(input, i)) {
count += 1;
}
}
i += 1;
}
return count;
}
Result: 15.87x faster than Python NLTK’s Punkt tokenizer.
Sparse Linear Scorer
Cache-friendly sparse matrix multiplication:
pub fn linearScoresSparseIds(
doc_offsets: []const u32,
feature_ids: []const u32,
feature_values: []const f64,
weights: []const f64,
bias: []const f64,
class_count: u32,
feature_count: u32,
out_scores: []f64
) void {
const doc_count = doc_offsets.len - 1;
for (0..doc_count) |doc_idx| {
const start = doc_offsets[doc_idx];
const end = doc_offsets[doc_idx + 1];
for (0..class_count) |class_idx| {
var score = bias[class_idx];
// Sparse dot product
for (start..end) |feat_idx| {
const fid = feature_ids[feat_idx];
const fval = feature_values[feat_idx];
const weight_idx = class_idx * feature_count + fid;
score += weights[weight_idx] * fval;
}
out_scores[doc_idx * class_count + class_idx] = score;
}
}
}
Result: 840x faster than Python scikit-learn for sparse scoring.
Collocation Finder
Top-K PMI with single-pass counting:
pub fn topPmiBigramsWindow(
input: []const u8,
window_size: u32,
top_k: u32,
allocator: std.mem.Allocator
) ![]BigramStat {
// Collect token IDs (first pass)
const token_ids = try collectTokenIds(input, allocator);
// Count windowed bigrams (second pass)
var bigram_counts = std.AutoHashMap(BigramKey, u64).init(allocator);
for (0..token_ids.len) |i| {
const end = @min(i + window_size, token_ids.len);
for (i + 1..end) |j| {
const key = BigramKey{
.left = token_ids[i],
.right = token_ids[j]
};
const entry = try bigram_counts.getOrPut(key);
if (!entry.found_existing) entry.value_ptr.* = 0;
entry.value_ptr.* += 1;
}
}
// Compute PMI scores
const stats = try computePmiScores(bigram_counts, token_ids, allocator);
// Partial sort for top-K
std.sort.pdq(BigramStat, stats, {}, comparePmi);
return stats[0..@min(top_k, stats.len)];
}
Result: 11.46x faster than Python NLTK’s collocation finder.
1. Batch Operations
Avoid repeated FFI calls by batching:
// ❌ Slow: 10,000 FFI calls
const stems = tokens.map(token => porterStemAscii(token));
// ✅ Fast: 1 FFI call
const stems = porterStemAsciiTokens(tokens);
2. Use Combined Metrics
Get multiple counts in one pass:
// ❌ Slow: 4 separate FFI calls
const tokenCount = countTokensAscii(text);
const uniqueTokens = countUniqueTokensAscii(text);
const ngramCount = countNgramsAscii(text, 2);
const uniqueNgrams = countUniqueNgramsAscii(text, 2);
// ✅ Fast: 1 FFI call for all 4 metrics
const metrics = computeAsciiMetrics(text, 2);
// => { tokens, uniqueTokens, ngrams, uniqueNgrams }
3. Pre-size Output Buffers
// ❌ Wasteful: Over-allocated buffer
const offsets = new Uint32Array(1000000); // Too large
// ✅ Optimal: Exact size
const capacity = countTokensAscii(text);
const offsets = new Uint32Array(capacity);
const lengths = new Uint32Array(capacity);
4. Reuse WASM Instances
// ❌ Slow: Initialize for each request
app.post('/analyze', async (req) => {
const wasm = await WasmNltk.init(); // 5-10ms overhead
const result = wasm.tokenizeAscii(req.body.text);
wasm.dispose();
return result;
});
// ✅ Fast: Initialize once
const wasm = await WasmNltk.init();
app.post('/analyze', async (req) => {
const result = wasm.tokenizeAscii(req.body.text);
return result;
});
5. Use Native Runtime for Large Data
// For datasets > 10MB, use native runtime
import { tokenizeAsciiNative } from 'bun_nltk/src/native';
const largeCorpus = fs.readFileSync('large.txt', 'utf-8');
const tokens = tokenizeAsciiNative(largeCorpus); // 2-3x faster than WASM
Profiling and Benchmarking
Run Built-in Benchmarks
# Generate 64MB synthetic dataset
bun run bench:generate
# Compare against Python NLTK
bun run bench:compare
bun run bench:compare:collocations
bun run bench:compare:porter
bun run bench:compare:punkt
bun run bench:compare:lm
# Native vs WASM comparison
bun run bench:compare:wasm
# SIMD vs scalar comparison
bun run bench:compare:simd
Custom Benchmarks
import { countTokensAscii } from 'bun_nltk/src/native';
const text = fs.readFileSync('corpus.txt', 'utf-8');
const iterations = 100;
const start = performance.now();
for (let i = 0; i < iterations; i++) {
countTokensAscii(text);
}
const elapsed = performance.now() - start;
const perIteration = elapsed / iterations;
console.log(`Avg time: ${perIteration.toFixed(2)}ms`);
bun_nltk includes automated performance gates in CI:
# Run performance regression check
bun run bench:gate
# Run SLA gate (p95 latency + memory)
bun run sla:gate
SLA thresholds are defined in scripts/sla-gate.ts.
Case Study: Text Classification Pipeline
Task: Classify 100,000 documents (avg 500 tokens each)
| Implementation | Time | Memory |
|---|
| Python NLTK + scikit-learn | 145 sec | 1.2 GB |
| bun_nltk native + JS classifier | 18 sec | 320 MB |
| Improvement | 8x faster | 3.75x less |
Pipeline:
- Tokenization
- Stopword removal
- Porter stemming
- TF-IDF vectorization
- Logistic regression classification
Case Study: Real-Time Sentence Segmentation
Task: Segment streaming news articles (avg 2KB each)
| Implementation | p50 Latency | p95 Latency | Throughput |
|---|
| Python NLTK | 45ms | 120ms | 22 docs/sec |
| bun_nltk native | 3ms | 8ms | 333 docs/sec |
| Improvement | 15x faster | 15x faster | 15x more |
Summary
bun_nltk’s performance advantages come from:
- Native compilation: Zig compiles to optimized machine code
- SIMD acceleration: 16-byte parallel processing on x86_64
- Zero-copy FFI: Direct memory access, no serialization
- Efficient algorithms: Single-pass processing, cache-friendly data structures
- Memory efficiency: Arena allocators, memory pooling, compact layouts
For detailed API usage, see the API Reference. For architecture details, see Architecture.