Language model selection
The key to multi-lingual support is choosing models that support your target languages:GPT-4 and GPT-4 Turbo
Excellent support for 50+ languages including European, Asian, and Middle Eastern languages
GPT-3.5 Turbo
Good support for major languages, more cost-effective for large-scale processing
Multilingual embeddings
text-embedding-3-small and text-embedding-3-large support 100+ languages
Azure OpenAI
Same language support with additional compliance and regional deployment options
Basic configuration
No special configuration is needed for most languages. The default settings work well:settings.yaml
Language-specific prompt tuning
For better results, customize prompts in your target language:- Spanish
- French
- Japanese
- Chinese
prompts/entity_extraction_es.txt
Processing mixed-language documents
When working with documents in multiple languages:Use language-agnostic prompts
For truly mixed content, use English prompts with instructions to handle multiple languages:
Querying in multiple languages
GraphRAG supports queries in different languages:The LLM will attempt to respond in the same language as your query. For mixed-language datasets, you can specify the desired response language in your query.
Best practices for specific languages
Chinese (Simplified/Traditional)
Chinese (Simplified/Traditional)
- Use larger chunk sizes due to character density
- Consider using
gpt-4for better understanding of classical vs. modern Chinese - Test entity extraction with both simplified and traditional characters
Japanese
Japanese
- Account for mixed scripts (Hiragana, Katakana, Kanji)
- Use character-based rather than token-based chunking
- Test entity extraction with company names (often use Kanji)
Arabic/Hebrew (RTL languages)
Arabic/Hebrew (RTL languages)
- Ensure text encoding is UTF-8
- Be aware that some entity names may be transliterated
- Test with mixed RTL/LTR content (common in technical documents)
German
German
- Long compound words may need special handling
- Consider noun capitalization in entity extraction
- Adjust chunk sizes for longer words
Example: Multi-lingual research corpus
Here’s a complete example for processing academic papers in multiple languages:Language detection and routing
For advanced use cases, implement language detection:Troubleshooting
Poor entity extraction in target language
Poor entity extraction in target language
Solutions:
- Use language-specific prompts
- Run auto prompt tuning with documents in target language
- Increase chunk size for languages with longer words/characters
- Verify model supports your target language well
Mixed language responses
Mixed language responses
Solutions:
- Explicitly specify response language in your query
- Use language-specific system prompts
- Separate documents by language during indexing
Encoding issues
Encoding issues
Solutions:
- Ensure all files are UTF-8 encoded
- Verify
.envfile doesn’t have encoding issues - Check that storage systems support Unicode
Supported languages
OpenAI models (GPT-4, GPT-3.5-turbo, embeddings) have strong support for:- European
- Asian
- Middle Eastern
- Others
- English, Spanish, French, German, Italian
- Portuguese, Dutch, Polish, Russian
- Swedish, Norwegian, Danish, Finnish
- Greek, Turkish, Czech, Romanian
Next steps
Custom prompts
Create language-specific prompts
Document Q&A
Build multilingual Q&A systems
Azure deployment
Deploy with regional Azure endpoints
Prompt tuning
Auto-tune prompts for your language