Language Models

Overview

bun_nltk provides efficient N-gram language models for estimating word probabilities and evaluating text. The implementation supports multiple smoothing techniques and leverages native code for optimal performance.

Model Types

Maximum Likelihood Estimation (MLE)

Basic probability estimation using raw frequency counts:

import { trainNgramLanguageModel } from "bun_nltk";

const sentences = [
  ["the", "cat", "sat"],
  ["the", "dog", "ran"],
  ["the", "cat", "ran"]
];

const lm = trainNgramLanguageModel(sentences, {
  order: 2,
  model: "mle"
});

const prob = lm.score("cat", ["the"]);
console.log(prob); // P(cat | the)

Lidstone Smoothing

Adds a constant gamma to all counts to handle unseen N-grams:

const lm = trainNgramLanguageModel(sentences, {
  order: 3,
  model: "lidstone",
  gamma: 0.1  // Default: 0.1
});

const prob = lm.score("unknown", ["the"]);

Kneser-Ney Interpolation

Sophisticated smoothing that considers continuation probabilities:

const lm = trainNgramLanguageModel(sentences, {
  order: 3,
  model: "kneser_ney_interpolated",
  discount: 0.75  // Default: 0.75
});

Training Options

export type NgramLanguageModelOptions = {
  order: number;           // N-gram size (1=unigram, 2=bigram, 3=trigram)
  model?: LanguageModelType; // "mle" | "lidstone" | "kneser_ney_interpolated"
  gamma?: number;          // Lidstone smoothing parameter (default: 0.1)
  discount?: number;       // Kneser-Ney discount (default: 0.75)
  padLeft?: boolean;       // Add start tokens (default: true)
  padRight?: boolean;      // Add end tokens (default: true)
  startToken?: string;     // Start symbol (default: "<s>")
  endToken?: string;       // End symbol (default: "</s>")
};

Scoring and Evaluation

Probability Scoring

const lm = trainNgramLanguageModel(sentences, {
  order: 2,
  model: "kneser_ney_interpolated"
});

// Get probability
const prob = lm.score("cat", ["the"]);

// Get log probability (base 2)
const logProb = lm.logScore("cat", ["the"]);

Perplexity Calculation

Measure how well the model predicts a sequence:

const testTokens = ["the", "cat", "sat"];
const perplexity = lm.perplexity(testTokens);
console.log(`Perplexity: ${perplexity.toFixed(2)}`);
// Lower perplexity = better model

Batch Evaluation

Evaluate multiple probes efficiently using native code:

const probes = [
  { word: "cat", context: ["the"] },
  { word: "dog", context: ["the"] },
  { word: "ran", context: ["cat"] }
];

const testSequence = ["the", "cat", "sat"];

const results = lm.evaluateBatch(probes, testSequence);
console.log("Scores:", results.scores);
console.log("Perplexity:", results.perplexity);

Advanced Usage

Custom Padding

const lm = trainNgramLanguageModel(sentences, {
  order: 3,
  padLeft: true,
  padRight: true,
  startToken: "<BOS>",
  endToken: "<EOS>"
});

Accessing Model Properties

console.log("Model order:", lm.order);
console.log("Model type:", lm.model);
console.log("Vocabulary:", lm.vocabulary);

Working with Large Corpora

import { loadBundledMiniCorpus } from "bun_nltk";

const corpus = loadBundledMiniCorpus();
const sentences = corpus.sents().map(sent => 
  sent.toLowerCase().split(/\s+/)
);

const lm = trainNgramLanguageModel(sentences, {
  order: 3,
  model: "kneser_ney_interpolated"
});

Performance Optimization

The implementation uses native code for:

N-gram counting and indexing (up to trigrams)
Batch probability evaluation
Perplexity calculation

For orders ≤ 3, models automatically use optimized native paths:

const lm = trainNgramLanguageModel(sentences, {
  order: 3,  // Will use native optimization
  model: "kneser_ney_interpolated"
});

// evaluateBatch automatically uses native code when available
const results = lm.evaluateBatch(probes, testSequence);

Example: Language Model Comparison

import { trainNgramLanguageModel } from "bun_nltk";

const trainSentences = [
  ["the", "cat", "sat", "on", "the", "mat"],
  ["the", "dog", "sat", "on", "the", "floor"],
  ["a", "cat", "ran", "fast"]
];

const testSentence = ["the", "cat", "ran"];

const models = ["mle", "lidstone", "kneser_ney_interpolated"] as const;

for (const modelType of models) {
  const lm = trainNgramLanguageModel(trainSentences, {
    order: 2,
    model: modelType
  });
  
  const ppl = lm.perplexity(testSentence);
  console.log(`${modelType}: perplexity = ${ppl.toFixed(2)}`);
}

API Reference

`trainNgramLanguageModel(sentences, options)`

Creates and trains an N-gram language model. Parameters:

sentences: string[][] - Array of tokenized sentences
options: NgramLanguageModelOptions - Model configuration

Returns: NgramLanguageModel

`NgramLanguageModel` Methods

score(word, context?) - Get probability P(word | context)
logScore(word, context?) - Get log₂ probability
perplexity(tokens) - Calculate perplexity on a sequence
evaluateBatch(probes, perplexityTokens) - Batch evaluation with native optimization

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

Overview

Model Types

Maximum Likelihood Estimation (MLE)

Lidstone Smoothing

Kneser-Ney Interpolation

Training Options

Scoring and Evaluation

Probability Scoring

Perplexity Calculation

Batch Evaluation

Advanced Usage

Custom Padding

Accessing Model Properties

Working with Large Corpora

Performance Optimization

Example: Language Model Comparison

API Reference

`trainNgramLanguageModel(sentences, options)`

`NgramLanguageModel` Methods

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

​Overview

​Model Types

​Maximum Likelihood Estimation (MLE)

​Lidstone Smoothing

​Kneser-Ney Interpolation

​Training Options

​Scoring and Evaluation

​Probability Scoring

​Perplexity Calculation

​Batch Evaluation

​Advanced Usage

​Custom Padding

​Accessing Model Properties

​Working with Large Corpora

​Performance Optimization

​Example: Language Model Comparison

​API Reference

​trainNgramLanguageModel(sentences, options)

​NgramLanguageModel Methods

Build docs developers (and LLMs) love

Overview

Model Types

Maximum Likelihood Estimation (MLE)

Lidstone Smoothing

Kneser-Ney Interpolation

Training Options

Scoring and Evaluation

Probability Scoring

Perplexity Calculation

Batch Evaluation

Advanced Usage

Custom Padding

Accessing Model Properties

Working with Large Corpora

Performance Optimization

Example: Language Model Comparison

API Reference

`trainNgramLanguageModel(sentences, options)`

`NgramLanguageModel` Methods