Skip to main content
Stemming is the process of reducing words to their base or root form. This improves search recall by matching different variations of the same word. For example:
  • “running”, “runs”, “ran” → “run”
  • “beautiful”, “beautifully” → “beauti”
  • “connection”, “connected”, “connecting” → “connect”

How Stemming Works

When stemming is enabled, Orama applies the stemming algorithm during both indexing and searching:
// Document indexed:
// "The developer is developing a new development"
// Tokens after stemming: ["develop", "develop", "new", "develop"]

// Search query:
// "developer"
// After stemming: "develop"

// Result: Matches found!
Stemming is disabled by default. You must explicitly enable it in your tokenizer configuration.

Enabling Stemming

English Stemming

For English, Orama includes a built-in Porter stemmer:
import { create } from '@orama/orama'

const db = await create({
  schema: {
    title: 'string',
    content: 'string'
  },
  components: {
    tokenizer: {
      language: 'english',
      stemming: true  // Enable built-in English stemmer
    }
  }
})

await insert(db, {
  title: 'Advanced Programming Techniques',
  content: 'Learn programming through practical examples'
})

// Searches for "program", "programming", "programmer" all match
const results = await search(db, {
  term: 'programmer'
})

Other Languages

For other languages, import the stemmer from @orama/stemmers:
import { create } from '@orama/orama'
import { stemmer, language } from '@orama/stemmers/italian'

const db = await create({
  schema: {
    title: 'string'
  },
  components: {
    tokenizer: {
      stemming: true,
      stemmer,
      language
    }
  }
})

Supported Languages

Orama provides stemmers for 30 languages:

Arabic

@orama/stemmers/arabic

Armenian

@orama/stemmers/armenian

Bulgarian

@orama/stemmers/bulgarian

Czech

@orama/stemmers/czech

Danish

@orama/stemmers/danish

Dutch

@orama/stemmers/dutch

English

Built-in

Finnish

@orama/stemmers/finnish

French

@orama/stemmers/french

German

@orama/stemmers/german

Greek

@orama/stemmers/greek

Hungarian

@orama/stemmers/hungarian

Indian

@orama/stemmers/indian

Indonesian

@orama/stemmers/indonesian

Irish

@orama/stemmers/irish

Italian

@orama/stemmers/italian

Lithuanian

@orama/stemmers/lithuanian

Nepali

@orama/stemmers/nepali

Norwegian

@orama/stemmers/norwegian

Portuguese

@orama/stemmers/portuguese

Romanian

@orama/stemmers/romanian

Russian

@orama/stemmers/russian

Sanskrit

@orama/stemmers/sanskrit

Serbian

@orama/stemmers/serbian

Slovenian

@orama/stemmers/slovenian

Spanish

@orama/stemmers/spanish

Swedish

@orama/stemmers/swedish

Tamil

@orama/stemmers/tamil

Turkish

@orama/stemmers/turkish

Ukrainian

@orama/stemmers/ukrainian
Chinese (Mandarin) and Japanese use specialized tokenizers instead of stemmers. See Languages for details.

Skipping Stemming for Specific Properties

You may want to disable stemming for certain fields:
import { create } from '@orama/orama'

const db = await create({
  schema: {
    sku: 'string',
    productName: 'string',
    description: 'string'
  },
  components: {
    tokenizer: {
      language: 'english',
      stemming: true,
      // Don't stem product names or SKUs
      stemmerSkipProperties: ['sku', 'productName']
    }
  }
})

await insert(db, {
  sku: 'RUNNING-SHOES-2024',
  productName: 'Nike Air Running Pro',
  description: 'Perfect running shoes for professional runners'
})

// "RUNNING-SHOES-2024" is NOT stemmed (exact match required)
// "Nike Air Running Pro" is NOT stemmed (exact product name)
// "Perfect running shoes..." IS stemmed: ["perfect", "run", "shoe", ...]
Use stemmerSkipProperties for fields where exact word forms matter, such as product names, brand names, or technical identifiers.

English Porter Stemmer

Orama’s built-in English stemmer implements the Porter stemming algorithm with multiple transformation steps:
// Example transformations:
stemmer('running')      // "run"
stemmer('flies')        // "fli"
stemmer('died')         // "die"
stemmer('agreed')       // "agre"
stemmer('national')     // "nation"
stemmer('traditional')  // "tradit"
stemmer('connection')   // "connect"
stemmer('activate')     // "activ"
The algorithm handles:
  • Plural forms (“cats” → “cat”)
  • Past tense (“walked” → “walk”)
  • Continuous forms (“running” → “run”)
  • Adverbs (“quickly” → “quick”)
  • Adjectives (“beautiful” → “beauti”)
  • Nouns (“nationalism” → “nation”)

Custom Stemmer

Implement a custom stemmer for specialized vocabularies:
import { create, Stemmer } from '@orama/orama'

// Simple suffix-stripping stemmer
const customStemmer: Stemmer = (word: string): string => {
  // Remove common English suffixes
  const suffixes = ['ing', 'ed', 'ly', 's', 'es']
  
  for (const suffix of suffixes) {
    if (word.endsWith(suffix)) {
      return word.slice(0, -suffix.length)
    }
  }
  
  return word
}

const db = await create({
  schema: {
    text: 'string'
  },
  components: {
    tokenizer: {
      language: 'english',
      stemmer: customStemmer,
      stemming: true
    }
  }
})
Custom stemmers should be deterministic and fast. The stemmer is called for every token during both indexing and searching.

When to Use Stemming

  • General content search: Blog posts, articles, documentation
  • E-commerce descriptions: Product descriptions with natural language
  • Support systems: Help articles, FAQs, knowledge bases
  • Social media: Posts, comments, messages
  • Technical terms: API names, function names, code identifiers
  • Product codes: SKUs, model numbers, part identifiers
  • Brand names: Company names, product names
  • Proper nouns: Person names, place names
  • Legal/medical text: Where exact terminology matters

Performance Impact

Memory Usage

Stemming typically reduces index size by 15-30% because different word forms map to the same stem:
// Without stemming: 3 separate index entries
["running", "runs", "runner"]

// With stemming: 1 index entry
["run"]

Search Speed

Stemming has minimal impact on search performance (typically less than 5% overhead) but can significantly improve recall.

Indexing Speed

Each token requires stemming during insertion, adding approximately 10-20% overhead to indexing time.

Stemming + Stopwords + Diacritics

The normalization pipeline combines multiple text processing steps:
import { create } from '@orama/orama'
import { stemmer, language } from '@orama/stemmers/french'
import { stopwords } from '@orama/stopwords/french'

const db = await create({
  schema: {
    content: 'string'
  },
  components: {
    tokenizer: {
      language,
      stemmer,
      stemming: true,
      stopWords: stopwords
    }
  }
})

// Input: "Les développeurs créent des applications"
// After tokenization: ["les", "développeurs", "créent", "des", "applications"]
// After stopwords: ["développeurs", "créent", "applications"] ("les", "des" removed)
// After stemming: ["développ", "cré", "applic"]
// After diacritics: ["developp", "cre", "applic"]
The processing order is:
  1. Tokenization (split text)
  2. Lowercase conversion
  3. Stopwords removal
  4. Stemming
  5. Diacritics removal
  6. Caching

Installation

npm install @orama/stemmers

Stopwords

Remove common words that don’t add meaning

Languages

See all supported languages and their features

Tokenization

Learn how text is split into tokens

Search

Use stemming to improve search results

Build docs developers (and LLMs) love