Skip to main content
Sentence tokenization divides text into individual sentences. bun_nltk provides two approaches: a fast heuristic tokenizer and the machine learning-based Punkt tokenizer.

Quick Start

import { sentenceTokenizeSubset } from "bun_nltk";

const text = "Dr. Smith went to the store. He bought milk. It was cold!";
const sentences = sentenceTokenizeSubset(text);
// [
//   "Dr. Smith went to the store.",
//   "He bought milk.",
//   "It was cold!"
// ]

Heuristic Tokenizer

Fast rule-based sentence splitter with abbreviation detection.

Basic Usage

import { sentenceTokenizeSubset } from "bun_nltk";

const text = `Mr. Johnson visited the U.S. last week. 
He met with Prof. Williams. It was productive!`;

const sentences = sentenceTokenizeSubset(text);
Signature:
type SentenceTokenizerOptions = {
  abbreviations?: Iterable<string>;      // Custom abbreviations
  learnAbbreviations?: boolean;          // Auto-detect abbreviations (default: true)
  orthographicHeuristics?: boolean;      // Use capitalization cues (default: true)
};

function sentenceTokenizeSubset(
  text: string, 
  options?: SentenceTokenizerOptions
): string[]
1

Default Behavior

import { sentenceTokenizeSubset } from "bun_nltk";

const text = "Dr. Smith works at NASA. He studies Mars.";
const sentences = sentenceTokenizeSubset(text);
console.log(sentences);
// [
//   "Dr. Smith works at NASA.",
//   "He studies Mars."
// ]
Built-in Abbreviations: mr, mrs, ms, dr, prof, sr, jr, st, vs, etc, e.g, i.e, u.s, u.k, a.m, p.m
2

Add Custom Abbreviations

const text = "The CEO of Inc. met with the CTO. They discussed APIs.";

const sentences = sentenceTokenizeSubset(text, {
  abbreviations: ["inc", "ceo", "cto", "api"]
});
console.log(sentences);
// [
//   "The CEO of Inc. met with the CTO.",
//   "They discussed APIs."
// ]
3

Disable Abbreviation Learning

const text = "This is a test. Dr. Smith agrees.";

// Don't auto-detect abbreviations
const sentences = sentenceTokenizeSubset(text, {
  learnAbbreviations: false
});
With learnAbbreviations: true (default), the tokenizer:
  • Analyzes the text for potential abbreviations
  • Checks if words ending in . are followed by lowercase/uppercase
  • Adds frequently-occurring patterns to the abbreviation list
4

Disable Orthographic Heuristics

const text = "The price is 19.99. the next item costs more.";

const sentences = sentenceTokenizeSubset(text, {
  orthographicHeuristics: false
});
With orthographicHeuristics: true (default):
  • Uses capitalization to detect sentence starts
  • Checks for common sentence-starting words
  • More accurate for well-formatted text

Punkt Tokenizer

Machine learning-based tokenizer with trainable models.

Using the Default Model

import { sentenceTokenizePunkt } from "bun_nltk";

const text = "Dr. Smith went to the U.S.A. He met Prof. Williams.";
const sentences = sentenceTokenizePunkt(text);
// [
//   "Dr. Smith went to the U.S.A.",
//   "He met Prof. Williams."
// ]
Signature:
function sentenceTokenizePunkt(
  text: string, 
  model?: PunktModelSerialized
): string[]
When no model is provided, uses the default model with common English abbreviations.

Training a Custom Model

1

Prepare Training Data

import { trainPunktModel } from "bun_nltk";

const trainingText = `
  Dr. Johnson visited the U.S. last year. He enjoyed it.
  Prof. Williams teaches at MIT. She specializes in AI.
  The conference was in Sep. 2024. It was successful.
  The CEO of Tech Inc. announced new products. Sales increased.
`;
2

Train the Model

import { trainPunktModel } from "bun_nltk";

const model = trainPunktModel(trainingText, {
  minAbbrevCount: 2,           // Min occurrences to detect abbreviation
  minCollocationCount: 2,      // Min occurrences for word pairs
  minSentenceStarterCount: 2   // Min occurrences for sentence starters
});

console.log(model.abbreviations);
// ["dr", "prof", "sep", "inc", ...]
Training Options:
type PunktTrainingOptions = {
  minAbbrevCount?: number;          // Default: 2
  minCollocationCount?: number;     // Default: 2  
  minSentenceStarterCount?: number; // Default: 2
};
3

Use the Trained Model

import { sentenceTokenizePunkt } from "bun_nltk";

const newText = "Dr. Johnson arrived. Prof. Williams was waiting.";
const sentences = sentenceTokenizePunkt(newText, model);
4

Save and Load Models

import { 
  serializePunktModel, 
  parsePunktModel,
  sentenceTokenizePunkt 
} from "bun_nltk";

// Save model to JSON
const modelJson = serializePunktModel(model);
await Bun.write("punkt_model.json", modelJson);

// Load model from JSON
const loadedJson = await Bun.file("punkt_model.json").text();
const loadedModel = parsePunktModel(loadedJson);

// Use loaded model
const sentences = sentenceTokenizePunkt(text, loadedModel);

Model Structure

type PunktModelSerialized = {
  version: number;                    // Model format version
  abbreviations: string[];            // Learned abbreviations
  collocations: Array<[string, string]>; // Word pairs (abbrev, following word)
  sentenceStarters: string[];         // Common sentence-starting words
};
Example Model:
{
  "version": 1,
  "abbreviations": ["dr", "prof", "inc", "ltd"],
  "collocations": [["dr", "smith"], ["prof", "jones"]],
  "sentenceStarters": ["he", "she", "they", "the"]
}

Native Implementation

For maximum performance with the default Punkt model:
import { sentenceTokenizePunktAsciiNative } from "bun_nltk";

const text = "First sentence. Second sentence! Third?";
const sentences = sentenceTokenizePunktAsciiNative(text);
// Uses optimized native implementation
Use sentenceTokenizePunktAsciiNative for 10-50x faster processing when you don’t need a custom model.

Advanced Use Cases

Domain-Specific Abbreviations

import { sentenceTokenizeSubset } from "bun_nltk";

// Medical text
const medicalText = "The pt. was diagnosed with HTN. The Dr. prescribed meds.";
const medicalSentences = sentenceTokenizeSubset(medicalText, {
  abbreviations: ["pt", "htn", "dr", "meds"]
});

// Legal text
const legalText = "The def. appealed per Art. 52. The ct. denied it.";
const legalSentences = sentenceTokenizeSubset(legalText, {
  abbreviations: ["def", "art", "ct", "vs", "etc"]
});

Training on Domain Corpus

import { trainPunktModel, sentenceTokenizePunkt } from "bun_nltk";

// Train on scientific papers
const scientificCorpus = await Bun.file("papers.txt").text();
const scientificModel = trainPunktModel(scientificCorpus, {
  minAbbrevCount: 3  // Higher threshold for more confidence
});

// Use for similar texts
const newPaper = "The exp. was conducted by Prof. Lee et al. Results were significant.";
const sentences = sentenceTokenizePunkt(newPaper, scientificModel);

Batch Processing

import { sentenceTokenizePunktAsciiNative } from "bun_nltk";

const documents = [
  "Document one. With multiple sentences.",
  "Document two. Also has sentences.",
  // ... many more
];

const allSentences = documents.flatMap(
  doc => sentenceTokenizePunktAsciiNative(doc)
);

Edge Cases Handled

  • Decimal numbers: "The price is 19.99. The next item costs more."
  • Initials: "J.R.R. Tolkien wrote novels."
  • Ellipsis: "Wait... what happened?"
  • Quotations: "She said 'Hello.' Then left."
  • Multiple punctuation: "Really?! Yes!!!"

Performance Tips

  1. Use native version for default Punkt model (fastest)
  2. Reuse trained models instead of retraining
  3. Provide abbreviations for specialized domains
  4. Disable learning if abbreviations are known
The Punkt tokenizer requires well-formed sentences in the training data. Poor training data will result in inaccurate models.

Build docs developers (and LLMs) love