Overview
The native library (bun_nltk) provides high-performance NLP operations through Zig-compiled binaries accessed via FFI (Foreign Function Interface). The native APIs offer significant performance improvements over WASM, especially for CPU-intensive operations.
Architecture
Native Library Stack
Prebuilt Binaries
The package ships with prebuilt binaries for common platforms:| Platform | Architecture | Library File | Path |
|---|---|---|---|
| Linux | x64 | bun_nltk.so | native/prebuilt/linux-x64/ |
| Windows | x64 | bun_nltk.dll | native/prebuilt/win32-x64/ |
| macOS | x64/arm64 | bun_nltk.dylib | native/prebuilt/darwin-*/ |
Loading Mechanism
The library automatically loads the appropriate prebuilt binary:Custom Library Path
You can override the library path using theBUN_NLTK_NATIVE_LIB environment variable:
FFI Bindings
Library Loading
The native library is loaded using Bun’sdlopen:
Type Mappings
FFI type mappings between TypeScript and Zig:| TypeScript Type | FFI Type | Zig Type | Description |
|---|---|---|---|
Uint8Array | ptr | [*]const u8 | Byte pointer |
number | usize | usize | Size/length |
number | u32 | u32 | 32-bit unsigned |
bigint | u64 | u64 | 64-bit unsigned |
Float32Array | ptr | [*]const f32 | Float pointer |
Float64Array | ptr | [*]const f64 | Double pointer |
Memory Management
Memory is managed through typed arrays:Error Handling
The native library uses error codes for failure reporting:Error Code Convention
0: Success (no error)1+: Error occurred (specific codes defined in Zig implementation)
API Categories
The native library provides functions in these categories:1. Tokenization
countTokensAscii()- Count tokenstokenizeAsciiNative()- Extract tokensnormalizeTokensAsciiNative()- Normalize and filtertokenFreqDistIdsAscii()- Token frequency distribution
2. Sentence Segmentation
sentenceTokenizePunktAsciiNative()- Punkt sentence tokenizer
3. N-grams & Metrics
countNgramsAscii()- Count n-gramsngramsAsciiNative()- Extract n-gramseverygramsAsciiNative()- Extract variable-length n-gramsskipgramsAsciiNative()- Extract skip-gramscomputeAsciiMetrics()- Comprehensive metrics
4. Collocations
topPmiBigramsAscii()- Top PMI bigramsbigramWindowStatsAscii()- Windowed bigram statistics
5. Part-of-Speech Tagging
posTagAsciiNative()- POS taggingperceptronPredictBatchNative()- Perceptron model inference
6. Stemming & Lemmatization
porterStemAscii()- Porter stemmerwordnetMorphyAsciiNative()- WordNet lemmatizer
7. Machine Learning
naiveBayesLogScoresIdsNative()- Naive Bayes scoringlinearScoresSparseIdsNative()- Linear model scoringevaluateLanguageModelIdsNative()- Language model evaluation
8. Parsing & Chunking
chunkIobIdsNative()- IOB chunkingcykRecognizeIdsNative()- CYK parsing
9. Streaming
NativeFreqDistStream- Streaming frequency distribution
Performance Characteristics
Native vs WASM vs JavaScript
| Operation | Native | WASM | JavaScript |
|---|---|---|---|
| Token counting | 100% | 85% | 30% |
| Tokenization | 100% | 80% | 35% |
| N-gram extraction | 100% | 75% | 25% |
| POS tagging | 100% | 70% | 20% |
Percentages are relative to native performance. Native is always fastest due to SIMD optimizations and zero-overhead FFI.
SIMD Optimizations
The native library uses SIMD instructions for:- Token scanning: Parallel byte comparisons for whitespace/punctuation
- Normalization: Vectorized character class checks
- Stopword filtering: Batch hash lookups
Platform Support
Supported Platforms
Linux x64 - Full support with prebuilt binaries
Windows x64 - Full support with prebuilt binaries
macOS x64/ARM64 - Build from source (prebuilts planned)
Unsupported Platforms
For platforms without prebuilt binaries, you can:- Use WASM fallback: Import from
WasmNltk - Build from source: Use
bun run build:zig - Set custom path: Use
BUN_NLTK_NATIVE_LIBenvironment variable
Utility Functions
nativeLibraryPath()
Returns the path to the loaded native library.Building from Source
To build the native library:native/prebuilt/<platform>-<arch>/.
See Also
- Performance APIs - SIMD optimizations and scalar fallbacks
- WASM Runtime - WebAssembly alternative