sentenceTokenizeSubset
Tokenize text into sentences using heuristic rules and abbreviation detection.The text to split into sentences
Optional configuration for sentence tokenization
Additional abbreviations to recognize (e.g., [“dr”, “prof”, “inc”])
Automatically learn abbreviations from the text
Use orthographic features (capitalization patterns) to improve sentence detection
Array of sentence strings
Default Abbreviations
The tokenizer recognizes common abbreviations:- Titles:
mr,mrs,ms,dr,prof,sr,jr - General:
st,vs,etc,e.g,i.e - Geographic:
u.s,u.k - Time:
a.m,p.m
Features
- Abbreviation Detection: Won’t split on periods after known abbreviations
- Number Handling: Won’t split on decimal points in numbers (e.g., “3.14”)
- Ellipsis Support: Handles ”…” correctly
- Capitalization Heuristics: Uses next word’s capitalization to determine sentence boundaries
sentenceTokenizePunkt
Tokenize sentences using a Punkt sentence segmentation model.The text to split into sentences
Optional trained Punkt model. If omitted, uses the default model or native implementation.
Array of sentence strings
Notes
- When
modelis omitted, uses fast native Punkt implementation - Custom models allow domain-specific abbreviation and collocation learning
- See punkt.mdx for model training details