Punkt Model

trainPunktModel

Train a Punkt sentence segmentation model on text data.

text

string

required

Training text to learn abbreviations, collocations, and sentence starters

options

PunktTrainingOptions

Optional training configuration

options.minAbbrevCount

number

default:"2"

Minimum occurrences for a token to be considered an abbreviation

options.minCollocationCount

number

default:"2"

Minimum occurrences for a token pair to be considered a collocation

options.minSentenceStarterCount

number

default:"2"

Minimum occurrences for a word to be considered a sentence starter

model

PunktModelSerialized

Trained Punkt model containing learned patterns

import { trainPunktModel } from 'bun_nltk';

const trainingCorpus = `
  Dr. Smith works at the hospital. He specializes in cardiology.
  Prof. Johnson teaches at the university. She published many papers.
  The N.A.S.A. mission was successful. U.S. officials celebrated.
`;

const model = trainPunktModel(trainingCorpus);
console.log(model);
// {
//   version: 1,
//   abbreviations: ["dr", "n.a.s.a", "prof", "u.s"],
//   collocations: [["dr", "smith"], ["prof", "johnson"]],
//   sentenceStarters: ["he", "she", "the", "u.s"]
// }

// Customize training thresholds
const strictModel = trainPunktModel(trainingCorpus, {
  minAbbrevCount: 3,
  minCollocationCount: 3,
  minSentenceStarterCount: 3
});

What Gets Learned

Abbreviations: Tokens followed by period that typically appear before lowercase words
Collocations: Abbreviation-word pairs that commonly occur together (e.g., “Dr. Smith”)
Sentence Starters: Words that frequently begin sentences in the training data

parsePunktModel

Parse a Punkt model from JSON string or object.

payload

string | PunktModelSerialized

required

JSON string or model object to parse

model

PunktModelSerialized

Parsed Punkt model

import { parsePunktModel } from 'bun_nltk';

const json = `{
  "version": 1,
  "abbreviations": ["dr", "mr", "ms"],
  "collocations": [["dr", "smith"]],
  "sentenceStarters": ["he", "she", "the"]
}`;

const model = parsePunktModel(json);

// Also accepts object
const model2 = parsePunktModel({
  version: 1,
  abbreviations: ["inc", "corp"],
  collocations: [],
  sentenceStarters: []
});

serializePunktModel

Serialize a Punkt model to JSON string for storage.

model

PunktModelSerialized

required

Punkt model to serialize

json

string

JSON string representation of the model

import { trainPunktModel, serializePunktModel } from 'bun_nltk';

const model = trainPunktModel("Dr. Smith arrived. He was late.");
const json = serializePunktModel(model);

// Save to file
await Bun.write("punkt-model.json", json);

// Load later
const loaded = parsePunktModel(await Bun.file("punkt-model.json").text());

defaultPunktModel

Get the default Punkt model with common English abbreviations.

model

PunktModelSerialized

Default Punkt model

import { defaultPunktModel } from 'bun_nltk';

const model = defaultPunktModel();
console.log(model.abbreviations);
// ["a.m", "dr", "e.g", "etc", "i.e", "jr", "mr", "mrs", "ms", "p.m", "prof", "sr", "st", "u.k", "u.s", "vs"]

console.log(model.collocations);
// [] (empty by default)

console.log(model.sentenceStarters);
// [] (empty by default)

Default Abbreviations

The default model includes:

Titles: mr, mrs, ms, dr, prof, sr, jr
Common: st, vs, etc, e.g, i.e
Geographic: u.s, u.k
Time: a.m, p.m

PunktModelSerialized Type

type PunktModelSerialized = {
  version: number;
  abbreviations: string[];
  collocations: Array<[string, string]>;
  sentenceStarters: string[];
};

Usage Example

import {
  trainPunktModel,
  serializePunktModel,
  parsePunktModel,
  sentenceTokenizePunkt
} from 'bun_nltk';

// Train on domain-specific corpus
const medicalText = `
  Dr. Anderson specializes in oncology. Prof. Lee works in immunology.
  The N.I.H. funded the research. Results were published in Jan. 2024.
`;

const model = trainPunktModel(medicalText);

// Serialize for storage
const json = serializePunktModel(model);
await Bun.write("medical-punkt.json", json);

// Load and use later
const loadedModel = parsePunktModel(
  await Bun.file("medical-punkt.json").text()
);

const text = "Dr. Anderson reviewed the N.I.H. grant. She approved it.";
const sentences = sentenceTokenizePunkt(text, loadedModel);
console.log(sentences);
// ["Dr. Anderson reviewed the N.I.H. grant.", "She approved it."]

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

trainPunktModel

What Gets Learned

parsePunktModel

serializePunktModel

defaultPunktModel

Default Abbreviations

PunktModelSerialized Type

Usage Example

Build docs developers (and LLMs) love

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

​trainPunktModel

​What Gets Learned

​parsePunktModel

​serializePunktModel

​defaultPunktModel

​Default Abbreviations

​PunktModelSerialized Type

​Usage Example

Build docs developers (and LLMs) love

trainPunktModel

What Gets Learned

parsePunktModel

serializePunktModel

defaultPunktModel

Default Abbreviations

PunktModelSerialized Type

Usage Example