Skip to main content

Overview

All notable changes to bun_nltk are documented here. The format is based on Keep a Changelog, and this project follows Semantic Versioning.

[Unreleased]

Added

  • (no entries yet)

[0.9.0] - 2026-02-28

Added

Parsing
  • Earley recognizer/parser APIs for CFG grammars (earleyRecognize, earleyParse, parseTextWithEarley)
  • Dependency parser APIs (dependencyParse, dependencyParseText) for lightweight arc generation
Classification
  • Sparse text feature vectorizer (TextFeatureVectorizer) and sparse-batch flattener (flattenSparseBatch)
  • Decision tree text classifier APIs (DecisionTreeTextClassifier)
  • Linear text classifier APIs (LogisticTextClassifier, LinearSvmTextClassifier)
  • Native Zig sparse linear scoring hot loop (bunnltk_linear_scores_sparse_ids) with Bun binding (linearScoresSparseIdsNative)
Corpora
  • Corpus registry manifest loader/downloader with SHA256 validation (loadCorpusRegistryManifest, downloadCorpusRegistry)
  • Imported corpus subset fixture pipeline from NLTK Brown/Treebank (corpus_subsets_fixture.json) and parity check (bench:parity:corpus-imported)
Benchmarks & Testing
  • Python-vs-native sparse linear scorer benchmark (bench:compare:linear, python_linear_scores_baseline.py)
  • Earley parser parity/benchmark harnesses (bench:parity:earley, bench:compare:earley) with Python NLTK baseline
  • Decision tree parity/benchmark harnesses (bench:parity:decision-tree, bench:compare:decision-tree) with Python NLTK baseline

Changed

  • CI/release Python dependencies now include numpy to keep MaxEnt parity baselines stable
  • Linear model training now uses native sparse scoring in batch SGD loops (with JS fallback toggle useNativeScoring)
  • Benchmark gates/trend checks now read shared config (bench/trend-config.json) and include linear scorer thresholds
  • Dashboard artifacts now include linear, decision-tree, and earley benchmark tracks plus new parity checks

[0.8.0] - 2026-02-27

Added

Native Hot Loops
  • Native Zig LM ID-evaluation hot loop (bunnltk_lm_eval_ids) with Bun bindings (evaluateLanguageModelIdsNative) and WASM equivalent
  • Native Zig chunk IOB hot loop (bunnltk_chunk_iob_ids) with Bun bindings (chunkIobIdsNative) and WASM equivalent
Parsing
  • CFG parser and chart parser subset APIs (parseCfgGrammar, chartParse, parseTextWithCfg) with Python parity tests and benchmarks
Classification
  • Naive Bayes text classifier APIs (NaiveBayesTextClassifier) with train/predict/evaluate/serialize support and Python parity tests/benchmarks
WordNet
  • Packed WordNet corpus pipeline script (wordnet:pack) and packed bundle loader (loadWordNetPacked)
  • Official WordNet deterministic pack workflow (wordnet:pack:official) with SHA256 manifest and verification script (wordnet:verify:pack)
Testing
  • Global Python parity suite (bench:parity:all) covering tokenizer, punkt, lm, chunk, wordnet, parser, classifier, and tagger

Changed

  • Browser WASM benchmark expanded for Punkt, LM, chunk, and WordNet workloads plus per-workload browser thresholds
  • Cross-feature SLA gate (sla:gate) is now part of bench:gate
  • CI now runs global parity suite and uploads official WordNet packed artifacts for validation

[0.7.0] - 2026-02-27

Added

Sentence Tokenization
  • Trainable Punkt tokenizer APIs with model serialization/parsing support
  • Zig native Punkt sentence splitting exports + WASM Punkt sentence splitting exports
WordNet
  • Mini WordNet dataset and lookup API (synsets, morphy, relation traversal)
  • Zig native WordNet morphy exports + WASM WordNet morphy exports
  • Extended WordNet bundle (models/wordnet_extended.json) and loader (loadWordNetExtended)
Language Models
  • N-gram language model stack with MLE, Lidstone, and interpolated Kneser-Ney
Chunking
  • Regexp chunk parser primitives with IOB conversion helper
Corpora
  • Corpus reader framework with bundled mini corpora (news, science, fiction)
  • Optional external corpus bundle loader (loadCorpusBundleFromIndex)
  • Tagged/chunked corpus format parsers (parseConllTagged, parseBrownTagged, parseConllChunked)
Testing
  • Python baseline harnesses for Punkt, LM, and chunk parser parity checks
  • Benchmark compare scripts for Punkt, LM, chunk parser, and WordNet

[0.6.2] - 2026-02-27

Added

Frequency Distributions
  • Native streaming FreqDist/ConditionalFreqDist builder APIs with JSON export
  • Python comparison benchmark for streaming distributions (bench:compare:freqdist)
Performance
  • SIMD/scalar comparison benchmark for tokenizer and normalization fast paths (bench:compare:simd)
  • Shared Zig perceptron inference core reused by native and WASM runtimes
Testing & Quality
  • NLTK coverage-slice fixture suite and parity report generator (parity:report)
  • Browser WASM benchmark harness and WASM size budget check scripts

Changed

Optimizations
  • countTokensAscii now uses an x86_64 SIMD fast path with scalar fallback
  • countNormalizedTokensAscii(..., false) now uses a direct token-count/offset fast path
  • posTagPerceptronAscii now uses native Zig inference by default (JS path retained via useNative: false)
Build & CI
  • CI now uploads parity and browser-WASM benchmark artifacts and enforces WASM size budget
  • WASM build uses ReleaseSmall + stripped output for browser-focused footprint

[0.6.1] - 2026-02-27

Added

Prebuilt Binaries
  • npm package now ships prebuilt native binaries for:
    • linux-x64 (native/prebuilt/linux-x64/bun_nltk.so)
    • win32-x64 (native/prebuilt/win32-x64/bun_nltk.dll)
  • Added cross-target prebuilt build script: bun run build:prebuilt
  • Added package payload verification script: bun run pack:verify:prebuilt
  • Added post-publish smoke workflow matrix on Linux + Windows to validate npm package install/runtime behavior without build steps

Changed

Native Runtime
  • Native runtime now resolves packaged prebuilt binary by platform/arch first
  • Native runtime no longer falls back implicitly to local build outputs
  • Release and CI workflows now build/verify prebuilt binaries as part of pipeline
  • npm package file allowlist now includes only required prebuilt binaries and wasm file

[0.6.0] - 2026-02-27

Added

Release Infrastructure
  • First stable npm release line for bun_nltk
  • Automated tag-based CI + release workflow with provenance publishing and benchmark dashboard artifacts

Changed

  • Package metadata aligned with npm provenance validation (repository, homepage, bugs)

[0.5.1-beta.2] - 2026-02-27

Changed

  • Added npm provenance-required package metadata (repository, homepage, bugs) to enable GitHub Actions publish with --provenance

[0.5.1-beta.1] - 2026-02-27

Added

Release Automation
  • Tag-based npm publish workflow with prerelease channel mapping (alpha, beta, rc, next)
  • Release metadata validator script (release:validate) that checks semver, tag/version match, and changelog section presence
  • Manual workflow_dispatch trigger for CI workflow

Changed

  • CI and Release workflows now use a reliable Zig setup action
  • Publishing and versioning docs updated with automated release flow details

[0.5.0] - 2026-02-27

Added

POS Tagging
  • Trained averaged perceptron POS tagger with generated model artifact
  • JS and WASM perceptron inference paths with batch prediction support
  • Perceptron parity and benchmark harnesses against Python baseline
Sentence Tokenization
  • Sentence tokenizer improvements: abbreviation learning and orthographic heuristics
Benchmarks
  • Benchmark dashboard generator (JSON + Markdown artifacts) with throughput and memory metrics
  • CI artifact upload for benchmark dashboard

Changed

  • bench:compare:tagger now benchmarks the trained perceptron path
  • Package metadata now includes semantic version, publish fields, and release check script

[0.4.0] - 2026-02-27

Added

Tokenization
  • Sentence tokenizer subset and parity fixtures
Normalization
  • Normalization pipeline (ASCII fast path + Unicode fallback) with optional stopword removal
POS Tagging
  • Rule-based POS tagger baseline and parity tests
WASM
  • Browser-focused WASM wrapper with pooled memory blocks
Performance
  • Performance gate script and CI workflow integration
  • Benchmark results table in README

[0.3.0] - 2026-02-27

Added

N-grams
  • Native everygrams/skipgrams APIs
  • Batch ASCII metrics API (tokens, uniqueTokens, ngrams, uniqueNgrams)
Testing
  • Fixture-driven parity tests for tokenizers, collocations, and Porter stemming

[0.2.0] - 2026-02-27

Added

Collocations
  • Windowed collocation scoring with PMI
  • Collision-free token-id frequency distribution APIs
Stemming
  • Native Porter stemmer
Tokenization
  • Tokenizer subset APIs
WASM
  • WASM build and comparison benchmarks

[0.1.0] - 2026-02-27

Added

Core Primitives
  • Zig native token and n-gram counting primitives
  • Unique token/ngram counting and hashed frequency distributions
  • Native token/ngram materialization APIs
Benchmarks
  • Python comparison benchmarks and synthetic dataset generation

Next Steps

Versioning Policy

Learn about semantic versioning and release process

Migration Guide

Migrate from Python NLTK to bun_nltk

Build docs developers (and LLMs) love