Multi-Language Support

Orama provides comprehensive multi-language support with language-specific tokenization rules, stemmers, and stopword lists for over 30 languages.

Supported Languages

Orama supports the following languages with full text analysis capabilities:

Arabic

Code: ar / arabic

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Armenian

Code: am / armenian

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Bulgarian

Code: bg / bulgarian

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Chinese (Mandarin)

Code: zh / mandarin

Tokenization: ✅ (Intl.Segmenter)
Stemming: ❌ (Not applicable)
Stopwords: ✅

Czech

Code: cz / czech

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Danish

Code: dk / danish

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Dutch

Code: nl / dutch

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

English

Code: en / english (default)

Tokenization: ✅
Stemming: ✅ (Built-in Porter)
Stopwords: ✅

Finnish

Code: fi / finnish

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

French

Code: fr / french

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

German

Code: de / german

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Greek

Code: gr / greek

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Hungarian

Code: hu / hungarian

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Indian (Hindi)

Code: in / indian

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Indonesian

Code: id / indonesian

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Irish

Code: ie / irish

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Italian

Code: it / italian

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Japanese

Code: ja / japanese

Tokenization: ✅ (Intl.Segmenter)
Stemming: ❌ (Not applicable)
Stopwords: ✅

Lithuanian

Code: lt / lithuanian

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Nepali

Code: np / nepali

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Norwegian

Code: no / norwegian

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Portuguese

Code: pt / portuguese

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Romanian

Code: ro / romanian

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Russian

Code: ru / russian

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Sanskrit

Code: sk / sanskrit

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Serbian

Code: rs / serbian

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Slovenian

Code: ru / slovenian

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Spanish

Code: es / spanish

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Swedish

Code: se / swedish

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Tamil

Code: ta / tamil

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Turkish

Code: tr / turkish

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Ukrainian

Code: uk / ukrainian

Tokenization: ✅
Stemming: ✅
Stopwords: ✅

Quick Start by Language

Western European Languages

import { create } from '@orama/orama'

// English is the default language
const db = await create({
  schema: { content: 'string' },
  components: {
    tokenizer: {
      language: 'english',
      stemming: true  // Built-in Porter stemmer
    }
  }
})

Asian Languages

import { create } from '@orama/orama'
import { createTokenizer } from '@orama/tokenizers/mandarin'
import { stopwords } from '@orama/stopwords/chinese'

const db = await create({
  schema: { content: 'string' },
  components: {
    tokenizer: await createTokenizer({
      language: 'mandarin',
      stopWords: stopwords
    })
  }
})

// Uses Intl.Segmenter for word boundary detection
await insert(db, {
  content: '我爱编程' // "I love programming"
})

Chinese and Japanese use Intl.Segmenter for tokenization instead of regex patterns. This provides more accurate word boundary detection for languages without spaces.

Slavic Languages

import { create } from '@orama/orama'
import { stemmer, language } from '@orama/stemmers/russian'
import { stopwords } from '@orama/stopwords/russian'

const db = await create({
  schema: { content: 'string' },
  components: {
    tokenizer: {
      language,
      stemmer,
      stemming: true,
      stopWords: stopwords
    }
  }
})

Language-Specific Tokenization Rules

Each language has specialized regex patterns that preserve language-specific characters:

French
German
Russian
Turkish
Arabic

// Pattern: /[^a-z0-9äâàéèëêïîöôùüûœç-]+/gim

// Preserves:
// - Accented characters: é, è, ê, à, â, ù, û, ô, ö, etc.
// - Ligatures: œ
// - Cedilla: ç

tokenize("L'été est très beau à Montréal")
// Result: ["l'été", "est", "très", "beau", "à", "montréal"]

// Pattern: /[^a-z0-9A-ZäöüÄÖÜß]+/gim

// Preserves:
// - Umlauts: ä, ö, ü, Ä, Ö, Ü
// - Eszett: ß

tokenize("Die Straße ist groß und schön")
// Result: ["die", "straße", "ist", "groß", "und", "schön"]

// Pattern: /[^a-z0-9а-яА-ЯёЁ]+/gim

// Preserves:
// - Cyrillic alphabet: а-я, А-Я
// - Yo character: ё, Ё

tokenize("Программирование это интересно")
// Result: ["программирование", "это", "интересно"]

// Pattern: /[^a-z0-9çÇğĞıİöÖşŞüÜ]+/gim

// Preserves:
// - Special characters: ç, ğ, ı, İ, ö, ş, ü
// - Both cases: lowercase and uppercase

tokenize("Merhaba dünya, nasılsın?")
// Result: ["merhaba", "dünya", "nasılsın"]

// Pattern: /[^a-z0-9أ-ي]+/gim

// Preserves:
// - Arabic alphabet: أ-ي
// - Includes special forms and diacritics

tokenize("مرحبا بك في عالم البرمجة")
// Result: ["مرحبا", "بك", "في", "عالم", "البرمجة"]

Multi-Language Applications

For applications supporting multiple languages, create separate Orama instances:

import { create } from '@orama/orama'
import { stemmer as enStemmer } from '@orama/stemmers/english'
import { stemmer as frStemmer, language as frLang } from '@orama/stemmers/french'
import { stemmer as deStemmer, language as deLang } from '@orama/stemmers/german'
import { stopwords as enStopwords } from '@orama/stopwords/english'
import { stopwords as frStopwords } from '@orama/stopwords/french'
import { stopwords as deStopwords } from '@orama/stopwords/german'

const databases = {
  en: await create({
    schema: { title: 'string', content: 'string' },
    components: {
      tokenizer: {
        language: 'english',
        stemming: true,
        stopWords: enStopwords
      }
    }
  }),
  
  fr: await create({
    schema: { title: 'string', content: 'string' },
    components: {
      tokenizer: {
        language: frLang,
        stemmer: frStemmer,
        stemming: true,
        stopWords: frStopwords
      }
    }
  }),
  
  de: await create({
    schema: { title: 'string', content: 'string' },
    components: {
      tokenizer: {
        language: deLang,
        stemmer: deStemmer,
        stemming: true,
        stopWords: deStopwords
      }
    }
  })
}

// Search in the appropriate database
function search(lang: 'en' | 'fr' | 'de', query: string) {
  return databases[lang].search({ term: query })
}

// Insert into language-specific database
async function addDocument(lang: 'en' | 'fr' | 'de', doc: any) {
  await insert(databases[lang], doc)
}

Language Detection

Orama doesn’t include automatic language detection. Consider using a library like franc or cld for language detection:

import { franc } from 'franc'
import { create, insert } from '@orama/orama'

// Map franc language codes to Orama language names
const langMap = {
  eng: 'english',
  fra: 'french',
  deu: 'german',
  spa: 'spanish',
  ita: 'italian',
  // ... add more mappings
}

async function addDocumentWithDetection(text: string) {
  const detectedLang = franc(text)
  const oramaLang = langMap[detectedLang] || 'english'
  
  // Use appropriate database for detected language
  await insert(databases[oramaLang], { content: text })
}

Package Structure

@orama/orama

Core package with built-in English support:

English tokenization
English Porter stemmer
No default stopwords (must be imported)

@orama/stemmers

Language-specific stemmers (30 languages):

import { stemmer, language } from '@orama/stemmers/[language]'

@orama/stopwords

Language-specific stopword lists (30 languages):

import { stopwords } from '@orama/stopwords/[language]'

@orama/tokenizers

Specialized tokenizers for Asian languages:

Chinese (Mandarin)
Japanese

import { createTokenizer } from '@orama/tokenizers/[language]'

Installation

npm install @orama/orama

Performance by Language

Latin-Based Languages

English, French, German, Spanish, Italian, Portuguese, Romanian

Tokenization: Very fast (regex-based)
Stemming: Fast (optimized algorithms)
Memory: Standard

Best performance due to simple character sets and well-optimized algorithms.

Cyrillic Languages

Russian, Ukrainian, Bulgarian, Serbian

Tokenization: Very fast (regex-based)
Stemming: Fast
Memory: Standard

Similar performance to Latin-based languages.

Asian Languages

Chinese (Mandarin), Japanese

Tokenization: Moderate (Intl.Segmenter)
Stemming: N/A
Memory: Higher (more complex characters)

Slightly slower tokenization due to Intl.Segmenter, but still performant for most use cases.

Complex Scripts

Arabic, Tamil, Nepali, Sanskrit

Tokenization: Fast (regex-based)
Stemming: Fast
Memory: Standard to Higher

Performance comparable to Latin languages, though character encoding may use slightly more memory.

Troubleshooting

Language Not Supported Error

// Error: LANGUAGE_NOT_SUPPORTED

// Make sure the language string matches exactly:
language: 'english'  // ✅ Correct
language: 'en'       // ❌ Wrong - use full name
language: 'English'  // ❌ Wrong - lowercase only

Missing Stemmer Error

// Error: MISSING_STEMMER

// For non-English languages, import the stemmer:
import { stemmer, language } from '@orama/stemmers/french'

components: {
  tokenizer: {
    language,      // ✅ Required
    stemmer,       // ✅ Required
    stemming: true // ✅ Required
  }
}

Special Characters Lost

// Characters disappearing after tokenization?
// Ensure you're using the correct language:

// ❌ Wrong: German text with English tokenizer
language: 'english'
// "Schöne Straße" → ["sch", "ne", "stra", "e"]

// ✅ Correct: German text with German tokenizer
language: 'german'
// "Schöne Straße" → ["schöne", "straße"]

Tokenization

Deep dive into how tokenization works

Stemming

Learn about stemming algorithms

Stopwords

Configure stopword filtering

Search

Use language features in search

Getting Started

Core Concepts

Search

Answer Engine (RAG)

Advanced Features

Text Analysis

Plugins

Framework Integrations

Guides

​Supported Languages

Arabic

Armenian

Bulgarian

Chinese (Mandarin)

Czech

Danish

Dutch

English

Finnish

French

German

Greek

Hungarian

Indian (Hindi)

Indonesian

Irish

Italian

Japanese

Lithuanian

Nepali

Norwegian

Portuguese

Romanian

Russian

Sanskrit

Serbian

Slovenian

Spanish

Swedish

Tamil

Turkish

Ukrainian

​Quick Start by Language

​Western European Languages

​Asian Languages

​Slavic Languages

​Language-Specific Tokenization Rules

​Multi-Language Applications

​Language Detection

​Package Structure

​@orama/orama

​@orama/stemmers

​@orama/stopwords

​@orama/tokenizers

​Installation

​Performance by Language

​Troubleshooting

​Related

Tokenization

Stemming

Stopwords

Search

Build docs developers (and LLMs) love

Supported Languages

Quick Start by Language

Western European Languages

Asian Languages

Slavic Languages

Language-Specific Tokenization Rules

Multi-Language Applications

Language Detection

Package Structure

@orama/orama

@orama/stemmers

@orama/stopwords

@orama/tokenizers

Installation

Performance by Language

Troubleshooting

Related