Skip to main content
Orama provides comprehensive multi-language support with language-specific tokenization rules, stemmers, and stopword lists for over 30 languages.

Supported Languages

Orama supports the following languages with full text analysis capabilities:

Arabic

Code: ar / arabic
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Armenian

Code: am / armenian
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Bulgarian

Code: bg / bulgarian
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Chinese (Mandarin)

Code: zh / mandarin
  • Tokenization: ✅ (Intl.Segmenter)
  • Stemming: ❌ (Not applicable)
  • Stopwords: ✅

Czech

Code: cz / czech
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Danish

Code: dk / danish
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Dutch

Code: nl / dutch
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

English

Code: en / english (default)
  • Tokenization: ✅
  • Stemming: ✅ (Built-in Porter)
  • Stopwords: ✅

Finnish

Code: fi / finnish
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

French

Code: fr / french
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

German

Code: de / german
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Greek

Code: gr / greek
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Hungarian

Code: hu / hungarian
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Indian (Hindi)

Code: in / indian
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Indonesian

Code: id / indonesian
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Irish

Code: ie / irish
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Italian

Code: it / italian
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Japanese

Code: ja / japanese
  • Tokenization: ✅ (Intl.Segmenter)
  • Stemming: ❌ (Not applicable)
  • Stopwords: ✅

Lithuanian

Code: lt / lithuanian
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Nepali

Code: np / nepali
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Norwegian

Code: no / norwegian
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Portuguese

Code: pt / portuguese
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Romanian

Code: ro / romanian
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Russian

Code: ru / russian
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Sanskrit

Code: sk / sanskrit
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Serbian

Code: rs / serbian
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Slovenian

Code: ru / slovenian
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Spanish

Code: es / spanish
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Swedish

Code: se / swedish
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Tamil

Code: ta / tamil
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Turkish

Code: tr / turkish
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Ukrainian

Code: uk / ukrainian
  • Tokenization: ✅
  • Stemming: ✅
  • Stopwords: ✅

Quick Start by Language

Western European Languages

import { create } from '@orama/orama'

// English is the default language
const db = await create({
  schema: { content: 'string' },
  components: {
    tokenizer: {
      language: 'english',
      stemming: true  // Built-in Porter stemmer
    }
  }
})

Asian Languages

import { create } from '@orama/orama'
import { createTokenizer } from '@orama/tokenizers/mandarin'
import { stopwords } from '@orama/stopwords/chinese'

const db = await create({
  schema: { content: 'string' },
  components: {
    tokenizer: await createTokenizer({
      language: 'mandarin',
      stopWords: stopwords
    })
  }
})

// Uses Intl.Segmenter for word boundary detection
await insert(db, {
  content: '我爱编程' // "I love programming"
})
Chinese and Japanese use Intl.Segmenter for tokenization instead of regex patterns. This provides more accurate word boundary detection for languages without spaces.

Slavic Languages

import { create } from '@orama/orama'
import { stemmer, language } from '@orama/stemmers/russian'
import { stopwords } from '@orama/stopwords/russian'

const db = await create({
  schema: { content: 'string' },
  components: {
    tokenizer: {
      language,
      stemmer,
      stemming: true,
      stopWords: stopwords
    }
  }
})

Language-Specific Tokenization Rules

Each language has specialized regex patterns that preserve language-specific characters:
// Pattern: /[^a-z0-9äâàéèëêïîöôùüûœç-]+/gim

// Preserves:
// - Accented characters: é, è, ê, à, â, ù, û, ô, ö, etc.
// - Ligatures: œ
// - Cedilla: ç

tokenize("L'été est très beau à Montréal")
// Result: ["l'été", "est", "très", "beau", "à", "montréal"]

Multi-Language Applications

For applications supporting multiple languages, create separate Orama instances:
import { create } from '@orama/orama'
import { stemmer as enStemmer } from '@orama/stemmers/english'
import { stemmer as frStemmer, language as frLang } from '@orama/stemmers/french'
import { stemmer as deStemmer, language as deLang } from '@orama/stemmers/german'
import { stopwords as enStopwords } from '@orama/stopwords/english'
import { stopwords as frStopwords } from '@orama/stopwords/french'
import { stopwords as deStopwords } from '@orama/stopwords/german'

const databases = {
  en: await create({
    schema: { title: 'string', content: 'string' },
    components: {
      tokenizer: {
        language: 'english',
        stemming: true,
        stopWords: enStopwords
      }
    }
  }),
  
  fr: await create({
    schema: { title: 'string', content: 'string' },
    components: {
      tokenizer: {
        language: frLang,
        stemmer: frStemmer,
        stemming: true,
        stopWords: frStopwords
      }
    }
  }),
  
  de: await create({
    schema: { title: 'string', content: 'string' },
    components: {
      tokenizer: {
        language: deLang,
        stemmer: deStemmer,
        stemming: true,
        stopWords: deStopwords
      }
    }
  })
}

// Search in the appropriate database
function search(lang: 'en' | 'fr' | 'de', query: string) {
  return databases[lang].search({ term: query })
}

// Insert into language-specific database
async function addDocument(lang: 'en' | 'fr' | 'de', doc: any) {
  await insert(databases[lang], doc)
}

Language Detection

Orama doesn’t include automatic language detection. Consider using a library like franc or cld for language detection:
import { franc } from 'franc'
import { create, insert } from '@orama/orama'

// Map franc language codes to Orama language names
const langMap = {
  eng: 'english',
  fra: 'french',
  deu: 'german',
  spa: 'spanish',
  ita: 'italian',
  // ... add more mappings
}

async function addDocumentWithDetection(text: string) {
  const detectedLang = franc(text)
  const oramaLang = langMap[detectedLang] || 'english'
  
  // Use appropriate database for detected language
  await insert(databases[oramaLang], { content: text })
}

Package Structure

@orama/orama

Core package with built-in English support:
  • English tokenization
  • English Porter stemmer
  • No default stopwords (must be imported)

@orama/stemmers

Language-specific stemmers (30 languages):
import { stemmer, language } from '@orama/stemmers/[language]'

@orama/stopwords

Language-specific stopword lists (30 languages):
import { stopwords } from '@orama/stopwords/[language]'

@orama/tokenizers

Specialized tokenizers for Asian languages:
  • Chinese (Mandarin)
  • Japanese
import { createTokenizer } from '@orama/tokenizers/[language]'

Installation

npm install @orama/orama

Performance by Language

English, French, German, Spanish, Italian, Portuguese, Romanian
  • Tokenization: Very fast (regex-based)
  • Stemming: Fast (optimized algorithms)
  • Memory: Standard
Best performance due to simple character sets and well-optimized algorithms.
Russian, Ukrainian, Bulgarian, Serbian
  • Tokenization: Very fast (regex-based)
  • Stemming: Fast
  • Memory: Standard
Similar performance to Latin-based languages.
Chinese (Mandarin), Japanese
  • Tokenization: Moderate (Intl.Segmenter)
  • Stemming: N/A
  • Memory: Higher (more complex characters)
Slightly slower tokenization due to Intl.Segmenter, but still performant for most use cases.
Arabic, Tamil, Nepali, Sanskrit
  • Tokenization: Fast (regex-based)
  • Stemming: Fast
  • Memory: Standard to Higher
Performance comparable to Latin languages, though character encoding may use slightly more memory.

Troubleshooting

// Error: LANGUAGE_NOT_SUPPORTED

// Make sure the language string matches exactly:
language: 'english'  // ✅ Correct
language: 'en'       // ❌ Wrong - use full name
language: 'English'  // ❌ Wrong - lowercase only
// Error: MISSING_STEMMER

// For non-English languages, import the stemmer:
import { stemmer, language } from '@orama/stemmers/french'

components: {
  tokenizer: {
    language,      // ✅ Required
    stemmer,       // ✅ Required
    stemming: true // ✅ Required
  }
}
// Characters disappearing after tokenization?
// Ensure you're using the correct language:

// ❌ Wrong: German text with English tokenizer
language: 'english'
// "Schöne Straße" → ["sch", "ne", "stra", "e"]

// ✅ Correct: German text with German tokenizer
language: 'german'
// "Schöne Straße" → ["schöne", "straße"]

Tokenization

Deep dive into how tokenization works

Stemming

Learn about stemming algorithms

Stopwords

Configure stopword filtering

Search

Use language features in search

Build docs developers (and LLMs) love