trainNgramLanguageModel
Convenience function to train an n-gram language model from tokenized sentences.sentences(string[][]) - Training corpus as an array of tokenized sentencesoptions(NgramLanguageModelOptions) - Model configuration options
NgramLanguageModel instance trained on the provided data
Basic Usage
Training Options
Order Selection
Theorder parameter determines the n-gram size:
- Higher order = more context, but requires more training data
- Order 2-3 works well for most applications
- Order 4+ may suffer from data sparsity
Model Type Selection
Smoothing Parameters
Gamma (Lidstone)
Controls the amount of probability mass redistributed to unseen events:Discount (Kneser-Ney)
Controls absolute discounting in Kneser-Ney smoothing:Padding Configuration
Left Padding (Start Tokens)
Add start tokens to the beginning of sentences:- Enable
padLeftto model sentence beginnings - Disable for mid-sentence predictions only
Right Padding (End Tokens)
Add end tokens to the end of sentences:- Enable
padRightto model sentence endings - Disable for continuous text modeling
Custom Padding Tokens
Complete Padding Example
Training Data Format
Input must be an array of tokenized sentences:- Sentences must be pre-tokenized (split into words)
- Use lowercase tokens for case-insensitive models
- Remove or handle punctuation before training