How Tokenization Works
When you insert a document into Orama, the tokenizer:- Splits text using language-specific regex patterns
- Normalizes tokens by converting to lowercase
- Removes diacritics (e.g., “café” becomes “cafe”)
- Applies stopwords filtering (optional)
- Applies stemming (optional)
- Removes duplicates (by default)
Language-Specific Splitting
Orama uses different regex patterns for each language to correctly identify word boundaries:Diacritics Removal
The tokenizer automatically removes diacritics from characters to improve search recall:Diacritics removal is automatically applied during the normalization phase. This happens after tokenization but before stemming.
Skipping Tokenization
For certain fields like IDs or exact-match fields, you may want to skip tokenization:Duplicate Handling
By default, Orama removes duplicate tokens to save memory and improve performance:Normalization Cache
Orama caches normalized tokens to improve performance for repeated terms:Advanced: Custom Tokenizer
You can implement a custom tokenizer for specialized use cases:Configuration Options
The language to use for tokenization. See Languages for all supported languages.
Whether to keep duplicate tokens in the token array.
Properties that should not be tokenized (indexed as single values).
Properties where stemming should not be applied.
Custom stopwords array, function, or
false to disable stopwords. See Stopwords.Performance Considerations
Normalization Caching
Normalization Caching
The normalization cache stores processed tokens with keys like
language:property:token. This significantly improves performance when the same terms appear frequently.The cache is stored in memory and grows with unique term variations. For most applications, this is negligible, but consider the trade-off for extremely large datasets.Duplicate Removal
Duplicate Removal
Keeping
allowDuplicates: false (default) reduces memory usage by 20-40% for typical text content, as many words appear multiple times in documents.Skip Properties
Skip Properties
Using
tokenizeSkipProperties for ID fields and exact-match fields reduces index size and improves insert performance.Related
Stemming
Learn about stemming and how it reduces words to their root form
Stopwords
Configure stopword filtering to remove common words
Languages
Explore all 30+ supported languages
Searching
Learn how to search with tokenized text