Supported Languages
Orama supports the following languages with full text analysis capabilities:Arabic
Code:
ar / arabic- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Armenian
Code:
am / armenian- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Bulgarian
Code:
bg / bulgarian- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Chinese (Mandarin)
Code:
zh / mandarin- Tokenization: ✅ (Intl.Segmenter)
- Stemming: ❌ (Not applicable)
- Stopwords: ✅
Czech
Code:
cz / czech- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Danish
Code:
dk / danish- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Dutch
Code:
nl / dutch- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
English
Code:
en / english (default)- Tokenization: ✅
- Stemming: ✅ (Built-in Porter)
- Stopwords: ✅
Finnish
Code:
fi / finnish- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
French
Code:
fr / french- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
German
Code:
de / german- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Greek
Code:
gr / greek- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Hungarian
Code:
hu / hungarian- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Indian (Hindi)
Code:
in / indian- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Indonesian
Code:
id / indonesian- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Irish
Code:
ie / irish- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Italian
Code:
it / italian- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Japanese
Code:
ja / japanese- Tokenization: ✅ (Intl.Segmenter)
- Stemming: ❌ (Not applicable)
- Stopwords: ✅
Lithuanian
Code:
lt / lithuanian- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Nepali
Code:
np / nepali- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Norwegian
Code:
no / norwegian- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Portuguese
Code:
pt / portuguese- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Romanian
Code:
ro / romanian- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Russian
Code:
ru / russian- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Sanskrit
Code:
sk / sanskrit- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Serbian
Code:
rs / serbian- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Slovenian
Code:
ru / slovenian- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Spanish
Code:
es / spanish- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Swedish
Code:
se / swedish- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Tamil
Code:
ta / tamil- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Turkish
Code:
tr / turkish- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Ukrainian
Code:
uk / ukrainian- Tokenization: ✅
- Stemming: ✅
- Stopwords: ✅
Quick Start by Language
Western European Languages
Asian Languages
Chinese and Japanese use
Intl.Segmenter for tokenization instead of regex patterns. This provides more accurate word boundary detection for languages without spaces.Slavic Languages
Language-Specific Tokenization Rules
Each language has specialized regex patterns that preserve language-specific characters:- French
- German
- Russian
- Turkish
- Arabic
Multi-Language Applications
For applications supporting multiple languages, create separate Orama instances:Language Detection
Orama doesn’t include automatic language detection. Consider using a library likefranc or cld for language detection:
Package Structure
@orama/orama
Core package with built-in English support:- English tokenization
- English Porter stemmer
- No default stopwords (must be imported)
@orama/stemmers
Language-specific stemmers (30 languages):@orama/stopwords
Language-specific stopword lists (30 languages):@orama/tokenizers
Specialized tokenizers for Asian languages:- Chinese (Mandarin)
- Japanese
Installation
Performance by Language
Latin-Based Languages
Latin-Based Languages
English, French, German, Spanish, Italian, Portuguese, Romanian
- Tokenization: Very fast (regex-based)
- Stemming: Fast (optimized algorithms)
- Memory: Standard
Cyrillic Languages
Cyrillic Languages
Russian, Ukrainian, Bulgarian, Serbian
- Tokenization: Very fast (regex-based)
- Stemming: Fast
- Memory: Standard
Asian Languages
Asian Languages
Chinese (Mandarin), Japanese
- Tokenization: Moderate (Intl.Segmenter)
- Stemming: N/A
- Memory: Higher (more complex characters)
Intl.Segmenter, but still performant for most use cases.Complex Scripts
Complex Scripts
Arabic, Tamil, Nepali, Sanskrit
- Tokenization: Fast (regex-based)
- Stemming: Fast
- Memory: Standard to Higher
Troubleshooting
Language Not Supported Error
Language Not Supported Error
Missing Stemmer Error
Missing Stemmer Error
Special Characters Lost
Special Characters Lost
Related
Tokenization
Deep dive into how tokenization works
Stemming
Learn about stemming algorithms
Stopwords
Configure stopword filtering
Search
Use language features in search