Skip to main content

Overview

All benchmarks compare bun_nltk’s Zig native implementation against Python NLTK on identical datasets and workloads.

Core Operations (64MB Dataset)

Benchmarks using bench/datasets/synthetic.txt on a 64MB synthetic text corpus.
WorkloadZig/Bun median secPython secFaster sideSpeedupPercent faster
Token + unique + ngram + unique ngram (bench:compare)2.76710.071Zig native3.64x263.93%
Top-K PMI collocations (bench:compare:collocations)2.09023.945Zig native11.46x1045.90%
Porter stemming (bench:compare:porter)11.942120.101Zig native10.06x905.70%
WASM token/ngram path (bench:compare:wasm)4.15013.241Zig WASM3.19x219.06%
Native vs Python in wasm suite (bench:compare:wasm)1.71913.241Zig native7.70x670.48%
Sentence tokenizer subset (bench:compare:sentence)1.68016.580Zig/Bun subset9.87x886.70%
Perceptron POS tagger (bench:compare:tagger)19.88082.849Zig native4.17x316.75%
Streaming FreqDist + ConditionalFreqDist (bench:compare:freqdist)3.20620.971Zig native6.54x554.17%

Key Takeaways

Collocation Detection - 11.46x speedup
  • PMI-based bigram collocation scoring shows the largest performance gain
  • Windowed bigram statistics computed in native Zig with minimal allocations
Stemming - 10.06x speedup
  • Porter stemmer implementation benefits from ASCII fast paths
  • Native string manipulation avoids Python interpreter overhead
Sentence Tokenization - 9.87x speedup
  • Punkt-compatible subset with abbreviation learning
  • Native implementation with orthographic heuristics
Frequency Distributions - 6.54x speedup
  • Streaming FreqDist and ConditionalFreqDist builders
  • Native hash tables with collision-free token ID mapping
POS Tagging - 4.17x speedup
  • Averaged perceptron tagger with native inference
  • Batch prediction with feature vector precomputation
Core Token Operations - 3.64x speedup
  • Combined token counting, unique tokens, n-grams, and unique n-grams
  • SIMD fast path for ASCII token counting (x86_64)

Extended Workloads (8MB Dataset)

Specialized benchmarks using an 8MB gate dataset for more complex operations.
WorkloadZig/Bun median secPython secFaster sideSpeedupPercent faster
Punkt tokenizer default path (bench:compare:punkt)0.08481.3463Zig native15.87x1487.19%
N-gram LM (Kneser-Ney) score+perplexity (bench:compare:lm)0.13242.8661Zig/Bun21.64x2064.19%
Regexp chunk parser (bench:compare:chunk)0.00241.5511Zig/Bun643.08x64208.28%
WordNet lookup + morphy workload (bench:compare:wordnet)0.00090.0835Zig/Bun91.55x9054.67%
CFG chart parser subset (bench:compare:parser)0.00880.3292Zig/Bun37.51x3651.05%
Naive Bayes text classifier (bench:compare:classifier)0.00810.0112Zig/Bun1.38x38.40%
PCFG Viterbi chart parser (bench:compare:pcfg)0.01910.4153Zig/Bun21.80x2080.00%
MaxEnt text classifier (bench:compare:maxent)0.02440.1824Zig/Bun7.46x646.00%
Sparse linear logits hot loop (bench:compare:linear)0.00242.0001Zig native840.54x83954.04%
Decision tree text classifier (bench:compare:decision-tree)0.07250.5720Zig/Bun7.89x688.55%
Earley parser workload (bench:compare:earley)0.11494.6483Zig/Bun40.47x3947.07%

Key Takeaways

Sparse Linear Scoring - 840.54x speedup
  • Native Zig hot loop for sparse matrix operations
  • Critical for training linear models (Logistic, SVM)
  • Minimal allocations with pre-flattened sparse batches
Chunk Parser - 643.08x speedup
  • Regexp-based IOB chunk tagging
  • Native compiled grammar matching
  • Native/WASM chunk IOB hot loop
WordNet Operations - 91.55x speedup
  • Synset lookups with packed binary format
  • Native morphy inflection recovery
  • Relation traversal (hypernyms, hyponyms, antonyms)
Earley Parser - 40.47x speedup
  • Recognition and parsing for arbitrary CFG grammars
  • Non-CNF grammar support
  • Chart-based parsing with native data structures
CFG Chart Parser - 37.51x speedup
  • Bottom-up chart parsing
  • Native production rule matching
  • Parse tree reconstruction
PCFG Viterbi Parser - 21.80x speedup
  • Probabilistic context-free grammar
  • Viterbi algorithm for best parse
  • Native probability computations
Language Models - 21.64x speedup
  • Kneser-Ney interpolated smoothing
  • Native ID-based evaluation hot loop
  • Batch scoring and perplexity computation
Punkt Tokenizer - 15.87x speedup
  • Full trainable Punkt model
  • Native sentence splitting fast path
  • Abbreviation and collocation handling
Decision Tree Classifier - 7.89x speedup
  • Text classification with decision trees
  • Native tree traversal and splitting
  • N-gram feature extraction
MaxEnt Classifier - 7.46x speedup
  • Maximum entropy text classification
  • Iterative parameter estimation
  • Native sparse feature scoring
Naive Bayes Classifier - 1.38x speedup
  • Probabilistic text classification
  • Laplace smoothing
  • Modest speedup due to simpler algorithm

SIMD Fast Path Comparison

Comparison of SIMD-accelerated paths vs scalar baseline:
bun run bench:compare:simd
Results:
  • countTokensAscii: 1.22x speedup (SIMD vs scalar)
  • Normalization (no stopwords): 2.73x speedup (fast path vs standard)
SIMD Optimization:
  • x86_64 vectorized token counting
  • Scalar fallback for other architectures
  • Automatic runtime detection

Running Benchmarks

Single Workload

# Core operations
bun run bench:compare

# Collocations
bun run bench:compare:collocations

# Porter stemmer
bun run bench:compare:porter

# Sentence tokenizer
bun run bench:compare:sentence

# POS tagger
bun run bench:compare:tagger

# Streaming FreqDist
bun run bench:compare:freqdist

Extended Workloads

# Language models
bun run bench:compare:lm

# Parsers
bun run bench:compare:parser
bun run bench:compare:earley

# Classifiers
bun run bench:compare:classifier
bun run bench:compare:decision-tree

# WordNet
bun run bench:compare:wordnet

# Chunk parser
bun run bench:compare:chunk

# Sparse linear scorer
bun run bench:compare:linear

Performance Notes

Sentence Tokenizer: This is a Punkt-compatible subset, not full Punkt parity on arbitrary corpora. The full Punkt tokenizer with trainable models shows 15.87x speedup in bench:compare:punkt.
WordNet: Full WordNet corpus is not bundled by default. A mini WordNet dataset is included. Full corpus can be packed from upstream with bun run wordnet:pack:official.
SIMD: Token counting uses x86_64 SIMD fast path with scalar fallback. Run bench:compare:simd to measure SIMD impact on your hardware.

Next Steps

WASM Performance

Compare WASM vs native vs Python performance

Benchmark Overview

Learn about benchmark methodology

Build docs developers (and LLMs) love