Tokenization

Tokenization is the process of breaking down text into individual tokens (words or terms) that can be indexed and searched. Orama’s tokenizer handles multiple languages with specialized splitting rules and normalization.

How Tokenization Works

When you insert a document into Orama, the tokenizer:

Splits text using language-specific regex patterns
Normalizes tokens by converting to lowercase
Removes diacritics (e.g., “café” becomes “cafe”)
Applies stopwords filtering (optional)
Applies stemming (optional)
Removes duplicates (by default)

import { create } from '@orama/orama'

const db = await create({
  schema: {
    title: 'string',
    description: 'string'
  }
})

// Text is automatically tokenized
await insert(db, {
  title: 'Building Modern Web Applications',
  description: 'Learn how to build scalable web apps'
})

// "Building Modern Web Applications" becomes:
// ["building", "modern", "web", "applications"]

Language-Specific Splitting

Orama uses different regex patterns for each language to correctly identify word boundaries:

// Splits on: /[^A-Za-zàèéìòóù0-9_'-]+/gim
const tokens = tokenizer.tokenize("It's a beautiful day!")
// Result: ["it's", "a", "beautiful", "day"]

Diacritics Removal

The tokenizer automatically removes diacritics from characters to improve search recall:

// Input: "café résumé naïve"
// After normalization: "cafe resume naive"

// Users can search for "cafe" and find "café"
await search(db, {
  term: 'cafe'
}) // Matches documents with "café" or "cafe"

Diacritics removal is automatically applied during the normalization phase. This happens after tokenization but before stemming.

Skipping Tokenization

For certain fields like IDs or exact-match fields, you may want to skip tokenization:

import { create } from '@orama/orama'

const db = await create({
  schema: {
    id: 'string',
    sku: 'string',
    title: 'string'
  },
  components: {
    tokenizer: {
      // These properties won't be split into tokens
      tokenizeSkipProperties: ['id', 'sku']
    }
  }
})

await insert(db, {
  id: 'PROD-123-XYZ',
  sku: 'ABC-DEF-GHI',
  title: 'Premium Widget'
})

// "PROD-123-XYZ" is indexed as a single token
// "Premium Widget" is tokenized to ["premium", "widget"]

Duplicate Handling

By default, Orama removes duplicate tokens to save memory and improve performance:

// Default behavior (allowDuplicates: false)
const tokens1 = tokenizer.tokenize('the best best best product')
// Result: ["best", "product"] ("the" removed as stopword)

// With duplicates enabled
const db = await create({
  schema: { description: 'string' },
  components: {
    tokenizer: {
      allowDuplicates: true
    }
  }
})

const tokens2 = tokenizer.tokenize('the best best best product')
// Result: ["best", "best", "best", "product"]

Enabling allowDuplicates increases memory usage and index size. Only use it if you need to preserve term frequency for relevance scoring.

Normalization Cache

Orama caches normalized tokens to improve performance for repeated terms:

// First normalization: computed and cached
const token1 = normalizeToken('', 'running')
// Cache key: "english::running"
// Value: "run" (after stemming)

// Second normalization: retrieved from cache
const token2 = normalizeToken('', 'running')
// Returns cached value immediately

Advanced: Custom Tokenizer

You can implement a custom tokenizer for specialized use cases:

import { create, Tokenizer } from '@orama/orama'

const customTokenizer: Tokenizer = {
  tokenize(input: string): string[] {
    // Custom tokenization logic
    return input
      .toLowerCase()
      .split(/[\s,]+/)
      .filter(token => token.length > 0)
  },
  language: 'custom',
  normalizationCache: new Map()
}

const db = await create({
  schema: { text: 'string' },
  components: {
    tokenizer: customTokenizer
  }
})

Configuration Options

language

string

default:"english"

The language to use for tokenization. See Languages for all supported languages.

allowDuplicates

boolean

default:false

Whether to keep duplicate tokens in the token array.

tokenizeSkipProperties

string | string[]

Properties that should not be tokenized (indexed as single values).

stemmerSkipProperties

string | string[]

Properties where stemming should not be applied.

stopWords

string[] | function | false

Custom stopwords array, function, or false to disable stopwords. See Stopwords.

stemming

boolean

Enable stemming for the configured language. See Stemming.

stemmer

function

Custom stemmer function. See Stemming.

Performance Considerations

Normalization Caching

The normalization cache stores processed tokens with keys like language:property:token. This significantly improves performance when the same terms appear frequently.The cache is stored in memory and grows with unique term variations. For most applications, this is negligible, but consider the trade-off for extremely large datasets.

Duplicate Removal

Keeping allowDuplicates: false (default) reduces memory usage by 20-40% for typical text content, as many words appear multiple times in documents.

Skip Properties

Using tokenizeSkipProperties for ID fields and exact-match fields reduces index size and improves insert performance.

Stemming

Learn about stemming and how it reduces words to their root form

Stopwords

Configure stopword filtering to remove common words

Languages

Explore all 30+ supported languages

Searching

Learn how to search with tokenized text

Getting Started

Core Concepts

Search

Answer Engine (RAG)

Advanced Features

Text Analysis

Plugins

Framework Integrations

Guides

Tokenization

How Tokenization Works

Language-Specific Splitting

Diacritics Removal

Skipping Tokenization

Duplicate Handling

Normalization Cache

Advanced: Custom Tokenizer

Configuration Options

Performance Considerations

Stemming

Stopwords

Languages

Searching

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Search

Answer Engine (RAG)

Advanced Features

Text Analysis

Plugins

Framework Integrations

Guides

​How Tokenization Works

​Language-Specific Splitting

​Diacritics Removal

​Skipping Tokenization

​Duplicate Handling

​Normalization Cache

​Advanced: Custom Tokenizer

​Configuration Options

​Performance Considerations

​Related

Stemming

Stopwords

Languages

Searching

Build docs developers (and LLMs) love

How Tokenization Works

Language-Specific Splitting

Diacritics Removal

Skipping Tokenization

Duplicate Handling

Normalization Cache

Advanced: Custom Tokenizer

Configuration Options

Performance Considerations

Related