Overview
bun_nltk provides efficient N-gram language models for estimating word probabilities and evaluating text. The implementation supports multiple smoothing techniques and leverages native code for optimal performance.Model Types
Maximum Likelihood Estimation (MLE)
Basic probability estimation using raw frequency counts:Lidstone Smoothing
Adds a constant gamma to all counts to handle unseen N-grams:Kneser-Ney Interpolation
Sophisticated smoothing that considers continuation probabilities:Training Options
Scoring and Evaluation
Probability Scoring
Perplexity Calculation
Measure how well the model predicts a sequence:Batch Evaluation
Evaluate multiple probes efficiently using native code:Advanced Usage
Custom Padding
Accessing Model Properties
Working with Large Corpora
Performance Optimization
The implementation uses native code for:- N-gram counting and indexing (up to trigrams)
- Batch probability evaluation
- Perplexity calculation
Example: Language Model Comparison
API Reference
trainNgramLanguageModel(sentences, options)
Creates and trains an N-gram language model.
Parameters:
sentences:string[][]- Array of tokenized sentencesoptions:NgramLanguageModelOptions- Model configuration
NgramLanguageModel
NgramLanguageModel Methods
score(word, context?)- Get probability P(word | context)logScore(word, context?)- Get log₂ probabilityperplexity(tokens)- Calculate perplexity on a sequenceevaluateBatch(probes, perplexityTokens)- Batch evaluation with native optimization