Stopwords - Orama

Stopwords are common words that appear frequently in text but usually don’t carry significant meaning for search. Words like “the”, “is”, “at”, “which”, and “on” are typically considered stopwords. Removing stopwords:

Reduces index size by 20-40%
Improves search relevance
Speeds up query processing
Reduces false positive matches

How Stopwords Work

During tokenization, Orama checks each token against the stopwords list and removes matches:

// Input text:
// "The quick brown fox jumps over the lazy dog"

// Without stopwords:
// ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

// With English stopwords:
// ["quick", "brown", "fox", "jumps", "lazy", "dog"]
// "the" and "over" removed

By default, Orama initializes with an empty stopwords array. You must explicitly provide stopwords to enable filtering.

Using Built-in Stopwords

Orama provides stopword lists for 30 languages through @orama/stopwords:

import { create } from '@orama/orama'
import { stopwords } from '@orama/stopwords/english'

const db = await create({
  schema: {
    title: 'string',
    content: 'string'
  },
  components: {
    tokenizer: {
      stopWords: stopwords
    }
  }
})

await insert(db, {
  title: 'The Art of Programming',
  content: 'Learn the best practices for coding'
})

// "The", "of", "the", "for" are filtered out
// Indexed tokens: ["art", "programming", "learn", "best", "practices", "coding"]

Supported Languages

Orama provides stopword lists for 30 languages:

Arabic

@orama/stopwords/arabic

Armenian

@orama/stopwords/armenian

Bulgarian

@orama/stopwords/bulgarian

Chinese

@orama/stopwords/chinese

Danish

@orama/stopwords/danish

Dutch

@orama/stopwords/dutch

English

@orama/stopwords/english

Finnish

@orama/stopwords/finnish

French

@orama/stopwords/french

German

@orama/stopwords/german

Greek

@orama/stopwords/greek

Hungarian

@orama/stopwords/hungarian

Indian

@orama/stopwords/indian

Indonesian

@orama/stopwords/indonesian

Irish

@orama/stopwords/irish

Italian

@orama/stopwords/italian

Japanese

@orama/stopwords/japanese

Nepali

@orama/stopwords/nepali

Norwegian

@orama/stopwords/norwegian

Portuguese

@orama/stopwords/portuguese

Romanian

@orama/stopwords/romanian

Russian

@orama/stopwords/russian

Sanskrit

@orama/stopwords/sanskrit

Serbian

@orama/stopwords/serbian

Slovenian

@orama/stopwords/slovenian

Spanish

@orama/stopwords/spanish

Swedish

@orama/stopwords/swedish

Tamil

@orama/stopwords/tamil

Turkish

@orama/stopwords/turkish

Ukrainian

@orama/stopwords/ukrainian

Custom Stopwords

Provide your own stopwords as an array:

import { create } from '@orama/orama'

const customStopwords = ['inc', 'ltd', 'corp', 'co', 'llc']

const db = await create({
  schema: {
    companyName: 'string'
  },
  components: {
    tokenizer: {
      stopWords: customStopwords
    }
  }
})

await insert(db, {
  companyName: 'Acme Corp'
})

// "Corp" is filtered out
// Indexed: ["acme"]

Extending Built-in Stopwords

Combine built-in stopwords with custom additions:

import { create } from '@orama/orama'
import { stopwords as englishStopwords } from '@orama/stopwords/english'

const db = await create({
  schema: {
    content: 'string'
  },
  components: {
    tokenizer: {
      stopWords: (defaultStopWords) => [
        ...englishStopwords,
        ...defaultStopWords,
        // Add custom domain-specific stopwords
        'lorem',
        'ipsum',
        'dolor',
        'click',
        'here'
      ]
    }
  }
})

The function receives an empty array by default since Orama doesn’t have default stopwords. This pattern allows you to chain stopword modifications.

Disabling Stopwords

Explicitly disable stopword filtering:

import { create } from '@orama/orama'

const db = await create({
  schema: {
    content: 'string'
  },
  components: {
    tokenizer: {
      stopWords: false  // Disable stopwords completely
    }
  }
})

// All words are indexed, including "the", "is", "at", etc.

English Stopwords List

The English stopwords package includes 204 common words:

// Pronouns
['i', 'me', 'my', 'myself', 'we', 'us', 'our', 'ours', 'ourselves',
 'you', 'your', 'yours', 'yourself', 'yourselves',
 'he', 'him', 'his', 'himself',
 'she', 'her', 'hers', 'herself',
 'it', 'its', 'itself',
 'they', 'them', 'their', 'theirs', 'themselves']

// Question words
['what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those']

// Verbs (to be, to have, to do)
['am', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
 'have', 'has', 'had', 'having',
 'do', 'does', 'did', 'doing']

// Modal verbs
['will', 'would', 'shall', 'should', 'can', 'could',
 'may', 'might', 'must', 'ought']

// Contractions
["i'm", "you're", "he's", "she's", "it's", "we're", "they're",
 "i've", "you've", "we've", "they've",
 "i'd", "you'd", "he'd", "she'd", "we'd", "they'd",
 "i'll", "you'll", "he'll", "she'll", "we'll", "they'll",
 "isn't", "aren't", "wasn't", "weren't",
 "hasn't", "haven't", "hadn't",
 "doesn't", "don't", "didn't",
 "won't", "wouldn't", "shan't", "shouldn't",
 "can't", "cannot", "couldn't", "mustn't"]

// Articles & conjunctions
['an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while']

// Prepositions
['of', 'at', 'by', 'for', 'with', 'about', 'against', 'between',
 'into', 'through', 'during', 'before', 'after', 'above', 'below',
 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under']

// Adverbs
['again', 'further', 'then', 'once',
 'here', 'there', 'when', 'where', 'why', 'how']

// Quantifiers
['all', 'any', 'both', 'each', 'few', 'more', 'most',
 'other', 'some', 'such']

// Negation & emphasis
['no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very']

When to Use Stopwords

Good Use Cases

Long-form content: Articles, blog posts, documentation
General search: When searching natural language text
Large datasets: To reduce index size significantly
E-commerce: Product descriptions with common filler words

Avoid Stopwords For

Short content: Tweets, headlines, titles (stopwords may be significant)
Technical content: Code, commands, where “to”, “in”, “at” may be important
Phrase search: When exact phrases like “to be or not to be” matter
Small datasets: Limited benefits if you have fewer than 1000 documents

Performance Impact

Index Size Reduction

Stopwords typically reduce index size by 20-40% depending on content type:

// Without stopwords: 1,000,000 tokens indexed
// With English stopwords: ~600,000 tokens indexed
// Reduction: 40%

Search Performance

Fewer tokens mean:

Faster searches (10-30% improvement)
Lower memory usage
Reduced disk I/O for persistent indexes

Insert Performance

Minimal overhead (~2-5%) for checking tokens against the stopwords list.

Stopwords with Stemming

Stopwords are removed before stemming in the normalization pipeline:

import { create } from '@orama/orama'
import { stopwords } from '@orama/stopwords/english'

const db = await create({
  schema: {
    content: 'string'
  },
  components: {
    tokenizer: {
      language: 'english',
      stemming: true,
      stopWords: stopwords
    }
  }
})

// Input: "The developers are developing applications"
// 1. Tokenize: ["the", "developers", "are", "developing", "applications"]
// 2. Remove stopwords: ["developers", "developing", "applications"]
// 3. Stem: ["develop", "develop", "applic"]
// 4. Result: ["develop", "applic"] (duplicates removed)

Validation

Orama validates stopwords configuration:

// ✅ Valid: Array of strings
stopWords: ['the', 'is', 'at']

// ✅ Valid: Function returning array of strings
stopWords: (defaults) => [...defaults, 'custom']

// ✅ Valid: Disable stopwords
stopWords: false

// ❌ Invalid: Number
stopWords: 123  // Error: CUSTOM_STOP_WORDS_MUST_BE_FUNCTION_OR_ARRAY

// ❌ Invalid: Array with non-strings
stopWords: ['the', 123, 'is']  // Error: CUSTOM_STOP_WORDS_MUST_BE_FUNCTION_OR_ARRAY

// ❌ Invalid: Function not returning array
stopWords: () => 'the'  // Error: CUSTOM_STOP_WORDS_MUST_BE_FUNCTION_OR_ARRAY

Domain-Specific Example

For an e-commerce site, you might want to filter brand-specific filler words:

import { create } from '@orama/orama'
import { stopwords as englishStopwords } from '@orama/stopwords/english'

const ecommerceStopwords = [
  ...englishStopwords,
  // Product description filler
  'new',
  'now',
  'available',
  'shop',
  'buy',
  'get',
  'free',
  'shipping',
  // Brand-specific
  'official',
  'authentic',
  'genuine'
]

const db = await create({
  schema: {
    productName: 'string',
    description: 'string'
  },
  components: {
    tokenizer: {
      stopWords: ecommerceStopwords,
      // Don't filter stopwords from product names
      stemmerSkipProperties: ['productName']
    }
  }
})

Installation

npm install @orama/stopwords

Stemming

Combine stopwords with stemming for optimal search

Tokenization

Learn how tokenization and stopwords work together

Languages

See stopwords support for all 30+ languages

Search

How stopwords affect search results

Getting Started

Core Concepts

Search

Answer Engine (RAG)

Advanced Features

Text Analysis

Plugins

Framework Integrations

Guides

​How Stopwords Work

​Using Built-in Stopwords

​Supported Languages

Arabic

Armenian

Bulgarian

Chinese

Danish

Dutch

English

Finnish

French

German

Greek

Hungarian

Indian

Indonesian

Irish

Italian

Japanese

Nepali

Norwegian

Portuguese

Romanian

Russian

Sanskrit

Serbian

Slovenian

Spanish

Swedish

Tamil

Turkish

Ukrainian

​Custom Stopwords

​Extending Built-in Stopwords

​Disabling Stopwords

​English Stopwords List

​When to Use Stopwords

​Performance Impact

​Index Size Reduction

​Search Performance

​Insert Performance

​Stopwords with Stemming

​Validation

​Domain-Specific Example

​Installation

​Related

Stemming

Tokenization

Languages

Search

Build docs developers (and LLMs) love

How Stopwords Work

Using Built-in Stopwords

Supported Languages

Custom Stopwords

Extending Built-in Stopwords

Disabling Stopwords

English Stopwords List

When to Use Stopwords

Performance Impact

Index Size Reduction

Search Performance

Insert Performance

Stopwords with Stemming

Validation

Domain-Specific Example

Installation

Related