Stemming - Orama

Stemming is the process of reducing words to their base or root form. This improves search recall by matching different variations of the same word. For example:

“running”, “runs”, “ran” → “run”
“beautiful”, “beautifully” → “beauti”
“connection”, “connected”, “connecting” → “connect”

How Stemming Works

When stemming is enabled, Orama applies the stemming algorithm during both indexing and searching:

// Document indexed:
// "The developer is developing a new development"
// Tokens after stemming: ["develop", "develop", "new", "develop"]

// Search query:
// "developer"
// After stemming: "develop"

// Result: Matches found!

Stemming is disabled by default. You must explicitly enable it in your tokenizer configuration.

Enabling Stemming

English Stemming

For English, Orama includes a built-in Porter stemmer:

import { create } from '@orama/orama'

const db = await create({
  schema: {
    title: 'string',
    content: 'string'
  },
  components: {
    tokenizer: {
      language: 'english',
      stemming: true  // Enable built-in English stemmer
    }
  }
})

await insert(db, {
  title: 'Advanced Programming Techniques',
  content: 'Learn programming through practical examples'
})

// Searches for "program", "programming", "programmer" all match
const results = await search(db, {
  term: 'programmer'
})

Other Languages

For other languages, import the stemmer from @orama/stemmers:

import { create } from '@orama/orama'
import { stemmer, language } from '@orama/stemmers/italian'

const db = await create({
  schema: {
    title: 'string'
  },
  components: {
    tokenizer: {
      stemming: true,
      stemmer,
      language
    }
  }
})

Supported Languages

Orama provides stemmers for 30 languages:

Arabic

@orama/stemmers/arabic

Armenian

@orama/stemmers/armenian

Bulgarian

@orama/stemmers/bulgarian

Czech

@orama/stemmers/czech

Danish

@orama/stemmers/danish

Dutch

@orama/stemmers/dutch

English

Built-in

Finnish

@orama/stemmers/finnish

French

@orama/stemmers/french

German

@orama/stemmers/german

Greek

@orama/stemmers/greek

Hungarian

@orama/stemmers/hungarian

Indian

@orama/stemmers/indian

Indonesian

@orama/stemmers/indonesian

Irish

@orama/stemmers/irish

Italian

@orama/stemmers/italian

Lithuanian

@orama/stemmers/lithuanian

Nepali

@orama/stemmers/nepali

Norwegian

@orama/stemmers/norwegian

Portuguese

@orama/stemmers/portuguese

Romanian

@orama/stemmers/romanian

Russian

@orama/stemmers/russian

Sanskrit

@orama/stemmers/sanskrit

Serbian

@orama/stemmers/serbian

Slovenian

@orama/stemmers/slovenian

Spanish

@orama/stemmers/spanish

Swedish

@orama/stemmers/swedish

Tamil

@orama/stemmers/tamil

Turkish

@orama/stemmers/turkish

Ukrainian

@orama/stemmers/ukrainian

Chinese (Mandarin) and Japanese use specialized tokenizers instead of stemmers. See Languages for details.

Skipping Stemming for Specific Properties

You may want to disable stemming for certain fields:

import { create } from '@orama/orama'

const db = await create({
  schema: {
    sku: 'string',
    productName: 'string',
    description: 'string'
  },
  components: {
    tokenizer: {
      language: 'english',
      stemming: true,
      // Don't stem product names or SKUs
      stemmerSkipProperties: ['sku', 'productName']
    }
  }
})

await insert(db, {
  sku: 'RUNNING-SHOES-2024',
  productName: 'Nike Air Running Pro',
  description: 'Perfect running shoes for professional runners'
})

// "RUNNING-SHOES-2024" is NOT stemmed (exact match required)
// "Nike Air Running Pro" is NOT stemmed (exact product name)
// "Perfect running shoes..." IS stemmed: ["perfect", "run", "shoe", ...]

Use stemmerSkipProperties for fields where exact word forms matter, such as product names, brand names, or technical identifiers.

English Porter Stemmer

Orama’s built-in English stemmer implements the Porter stemming algorithm with multiple transformation steps:

// Example transformations:
stemmer('running')      // "run"
stemmer('flies')        // "fli"
stemmer('died')         // "die"
stemmer('agreed')       // "agre"
stemmer('national')     // "nation"
stemmer('traditional')  // "tradit"
stemmer('connection')   // "connect"
stemmer('activate')     // "activ"

The algorithm handles:

Plural forms (“cats” → “cat”)
Past tense (“walked” → “walk”)
Continuous forms (“running” → “run”)
Adverbs (“quickly” → “quick”)
Adjectives (“beautiful” → “beauti”)
Nouns (“nationalism” → “nation”)

Custom Stemmer

Implement a custom stemmer for specialized vocabularies:

import { create, Stemmer } from '@orama/orama'

// Simple suffix-stripping stemmer
const customStemmer: Stemmer = (word: string): string => {
  // Remove common English suffixes
  const suffixes = ['ing', 'ed', 'ly', 's', 'es']
  
  for (const suffix of suffixes) {
    if (word.endsWith(suffix)) {
      return word.slice(0, -suffix.length)
    }
  }
  
  return word
}

const db = await create({
  schema: {
    text: 'string'
  },
  components: {
    tokenizer: {
      language: 'english',
      stemmer: customStemmer,
      stemming: true
    }
  }
})

Custom stemmers should be deterministic and fast. The stemmer is called for every token during both indexing and searching.

When to Use Stemming

Good Use Cases

General content search: Blog posts, articles, documentation
E-commerce descriptions: Product descriptions with natural language
Support systems: Help articles, FAQs, knowledge bases
Social media: Posts, comments, messages

Avoid Stemming For

Technical terms: API names, function names, code identifiers
Product codes: SKUs, model numbers, part identifiers
Brand names: Company names, product names
Proper nouns: Person names, place names
Legal/medical text: Where exact terminology matters

Performance Impact

Memory Usage

Stemming typically reduces index size by 15-30% because different word forms map to the same stem:

// Without stemming: 3 separate index entries
["running", "runs", "runner"]

// With stemming: 1 index entry
["run"]

Search Speed

Stemming has minimal impact on search performance (typically less than 5% overhead) but can significantly improve recall.

Indexing Speed

Each token requires stemming during insertion, adding approximately 10-20% overhead to indexing time.

Stemming + Stopwords + Diacritics

The normalization pipeline combines multiple text processing steps:

import { create } from '@orama/orama'
import { stemmer, language } from '@orama/stemmers/french'
import { stopwords } from '@orama/stopwords/french'

const db = await create({
  schema: {
    content: 'string'
  },
  components: {
    tokenizer: {
      language,
      stemmer,
      stemming: true,
      stopWords: stopwords
    }
  }
})

// Input: "Les développeurs créent des applications"
// After tokenization: ["les", "développeurs", "créent", "des", "applications"]
// After stopwords: ["développeurs", "créent", "applications"] ("les", "des" removed)
// After stemming: ["développ", "cré", "applic"]
// After diacritics: ["developp", "cre", "applic"]

The processing order is:

Tokenization (split text)
Lowercase conversion
Stopwords removal
Stemming
Diacritics removal
Caching

Installation

npm install @orama/stemmers

Stopwords

Remove common words that don’t add meaning

Languages

See all supported languages and their features

Tokenization

Learn how text is split into tokens

Search

Use stemming to improve search results

Getting Started

Core Concepts

Search

Answer Engine (RAG)

Advanced Features

Text Analysis

Plugins

Framework Integrations

Guides

​How Stemming Works

​Enabling Stemming

​English Stemming

​Other Languages

​Supported Languages

Arabic

Armenian

Bulgarian

Czech

Danish

Dutch

English

Finnish

French

German

Greek

Hungarian

Indian

Indonesian

Irish

Italian

Lithuanian

Nepali

Norwegian

Portuguese

Romanian

Russian

Sanskrit

Serbian

Slovenian

Spanish

Swedish

Tamil

Turkish

Ukrainian

​Skipping Stemming for Specific Properties

​English Porter Stemmer

​Custom Stemmer

​When to Use Stemming

​Performance Impact

​Memory Usage

​Search Speed

​Indexing Speed

​Stemming + Stopwords + Diacritics

​Installation

​Related

Stopwords

Languages

Tokenization

Search

Build docs developers (and LLMs) love

How Stemming Works

Enabling Stemming

English Stemming

Other Languages

Supported Languages

Skipping Stemming for Specific Properties

English Porter Stemmer

Custom Stemmer

When to Use Stemming

Performance Impact

Memory Usage

Search Speed

Indexing Speed

Stemming + Stopwords + Diacritics

Installation

Related