Multi-lingual support

GraphRAG supports processing and querying documents in multiple languages by leveraging multilingual language models. This guide shows you how to configure GraphRAG for non-English content and mixed-language datasets.

Language model selection

The key to multi-lingual support is choosing models that support your target languages:

GPT-4 and GPT-4 Turbo

Excellent support for 50+ languages including European, Asian, and Middle Eastern languages

GPT-3.5 Turbo

Good support for major languages, more cost-effective for large-scale processing

Multilingual embeddings

text-embedding-3-small and text-embedding-3-large support 100+ languages

Azure OpenAI

Same language support with additional compliance and regional deployment options

Basic configuration

No special configuration is needed for most languages. The default settings work well:

settings.yaml

completion_models:
  default_completion_model:
    model_provider: openai
    model: gpt-4  # Supports multiple languages out of the box
    api_key: ${GRAPHRAG_API_KEY}

embedding_models:
  default_embedding_model:
    model_provider: openai
    model: text-embedding-3-small  # Multilingual embeddings
    api_key: ${GRAPHRAG_API_KEY}

Language-specific prompt tuning

For better results, customize prompts in your target language:

Spanish
French
Japanese
Chinese

prompts/entity_extraction_es.txt

-Objetivo-
Dado un documento de texto, identifique todas las entidades y sus relaciones.

-Pasos-
1. Identifique todas las entidades en el texto
2. Para cada entidad, extraiga los atributos relevantes
3. Identifique las relaciones entre entidades
4. Formatee la salida según lo especificado

-Tipos de Entidad-
- PERSONA: Individuos humanos
- ORGANIZACIÓN: Empresas, instituciones
- UBICACIÓN: Lugares físicos o virtuales
- EVENTO: Sucesos significativos

prompts/entity_extraction_fr.txt

-Objectif-
Étant donné un document texte, identifiez toutes les entités et leurs relations.

-Étapes-
1. Identifiez toutes les entités dans le texte
2. Pour chaque entité, extrayez les attributs pertinents
3. Identifiez les relations entre les entités
4. Formatez la sortie comme spécifié

-Types d'Entité-
- PERSONNE: Individus humains
- ORGANISATION: Entreprises, institutions
- LIEU: Endroits physiques ou virtuels
- ÉVÉNEMENT: Événements importants

prompts/entity_extraction_ja.txt

-目標-
テキスト文書が与えられた場合、すべてのエンティティとその関係を識別します。

-手順-
1. テキスト内のすべてのエンティティを識別する
2. 各エンティティについて、関連する属性を抽出する
3. エンティティ間の関係を識別する
4. 指定された形式で出力する

-エンティティタイプ-
- 人物: 人間個人
- 組織: 企業、機関
- 場所: 物理的または仮想的な場所
- イベント: 重要な出来事

prompts/entity_extraction_zh.txt

-目标-
给定一个文本文档，识别所有实体及其关系。

-步骤-
1. 识别文本中的所有实体
2. 对于每个实体，提取相关属性
3. 识别实体之间的关系
4. 按指定格式输出

-实体类型-
- 人物: 人类个体
- 组织: 公司、机构
- 地点: 物理或虚拟位置
- 事件: 重要事件

Processing mixed-language documents

When working with documents in multiple languages:

Organize by language

Optionally separate documents by language for better tracking:

input/
├── en/
│   ├── document1.txt
│   └── document2.txt
├── es/
│   ├── documento1.txt
│   └── documento2.txt
└── fr/
    ├── document1.txt
    └── document2.txt

Use language-agnostic prompts

For truly mixed content, use English prompts with instructions to handle multiple languages:

-Goal-
Given a text document that may contain content in multiple languages,
identify all entities and relationships. Preserve entity names in their
original language and provide descriptions in the source language.

-Instructions-
- Maintain entity names in original language
- Generate descriptions in the same language as the source text
- Identify language of each extracted entity

Configure language metadata

Track language information in your chunking configuration:

chunking:
  prepend_metadata: ["language", "source_file"]

Querying in multiple languages

GraphRAG supports queries in different languages:

graphrag query "What are the main themes in this dataset?"

The LLM will attempt to respond in the same language as your query. For mixed-language datasets, you can specify the desired response language in your query.

Best practices for specific languages

Chinese (Simplified/Traditional)

Use larger chunk sizes due to character density
Consider using gpt-4 for better understanding of classical vs. modern Chinese
Test entity extraction with both simplified and traditional characters

chunking:
  size: 800  # Increase from default 400
  overlap: 200  # Increase overlap proportionally

Japanese

Account for mixed scripts (Hiragana, Katakana, Kanji)
Use character-based rather than token-based chunking
Test entity extraction with company names (often use Kanji)

chunking:
  type: sentence  # Better for Japanese text
  size: 600

Arabic/Hebrew (RTL languages)

Ensure text encoding is UTF-8
Be aware that some entity names may be transliterated
Test with mixed RTL/LTR content (common in technical documents)

input:
  encoding: utf-8

German

Long compound words may need special handling
Consider noun capitalization in entity extraction
Adjust chunk sizes for longer words

chunking:
  size: 500  # Slightly larger for compound words

Example: Multi-lingual research corpus

Here’s a complete example for processing academic papers in multiple languages:

Prepare data

Organize papers with language metadata:

input/papers.csv

id,title,text,language,author,year
1,"Machine Learning Basics","Full text...",en,"Smith",2023
2,"Apprentissage Automatique","Texte complet...",fr,"Dubois",2023
3,"機械学習の基礎","全文...",ja,"田中",2023

Configure for CSV input

settings.yaml

input:
  type: csv
  file_pattern: .*\.csv$
  id_column: id
  title_column: title
  text_column: text

chunking:
  prepend_metadata: ["language", "author", "year"]
  size: 600
  overlap: 100

Create multilingual prompts

prompts/entity_extraction_multilingual.txt

-Goal-
Extract academic entities from research papers in any language.
Preserve technical terms and names in their original language.

-Entity Types-
- RESEARCHER: Authors and cited researchers
- CONCEPT: Scientific concepts and methods
- INSTITUTION: Universities and research organizations
- PUBLICATION: Papers, journals, conferences

-Instructions-
- Keep researcher names in original form
- Preserve technical terminology in source language
- Translate descriptions to match source document language

Query across languages

# Query in English
graphrag query "What machine learning concepts are discussed?" --method global

# Query in French
graphrag query "Quels chercheurs sont mentionnés?" --method local

# Query in Japanese  
graphrag query "どの研究機関が関与していますか？" --method local

Language detection and routing

For advanced use cases, implement language detection:

from langdetect import detect
import pandas as pd

# Detect language in documents
df = pd.read_csv('input/documents.csv')
df['detected_language'] = df['text'].apply(detect)

# Route to language-specific prompts
language_prompts = {
    'en': 'prompts/entity_extraction_en.txt',
    'es': 'prompts/entity_extraction_es.txt',
    'fr': 'prompts/entity_extraction_fr.txt',
    'ja': 'prompts/entity_extraction_ja.txt',
}

Troubleshooting

Poor entity extraction in target language

Solutions:

Use language-specific prompts
Run auto prompt tuning with documents in target language
Increase chunk size for languages with longer words/characters
Verify model supports your target language well

Mixed language responses

Solutions:

Explicitly specify response language in your query
Use language-specific system prompts
Separate documents by language during indexing

Encoding issues

Solutions:

Ensure all files are UTF-8 encoded
Verify .env file doesn’t have encoding issues
Check that storage systems support Unicode

Supported languages

OpenAI models (GPT-4, GPT-3.5-turbo, embeddings) have strong support for:

European
Asian
Middle Eastern
Others

English, Spanish, French, German, Italian
Portuguese, Dutch, Polish, Russian
Swedish, Norwegian, Danish, Finnish
Greek, Turkish, Czech, Romanian

Next steps

Custom prompts

Create language-specific prompts

Document Q&A

Build multilingual Q&A systems

Azure deployment

Deploy with regional Azure endpoints

Prompt tuning

Auto-tune prompts for your language

Tutorials

Notebooks

Use Cases

Multi-lingual support

Language model selection

GPT-4 and GPT-4 Turbo

GPT-3.5 Turbo

Multilingual embeddings

Azure OpenAI

Basic configuration

Language-specific prompt tuning

Processing mixed-language documents

Querying in multiple languages

Best practices for specific languages

Example: Multi-lingual research corpus

Language detection and routing

Troubleshooting

Supported languages

Next steps

Custom prompts

Document Q&A

Azure deployment

Prompt tuning

Build docs developers (and LLMs) love

Tutorials

Notebooks

Use Cases

​Language model selection

GPT-4 and GPT-4 Turbo

GPT-3.5 Turbo

Multilingual embeddings

Azure OpenAI

​Basic configuration

​Language-specific prompt tuning

​Processing mixed-language documents

​Querying in multiple languages

​Best practices for specific languages

​Example: Multi-lingual research corpus

​Language detection and routing

​Troubleshooting

​Supported languages

​Next steps

Custom prompts

Document Q&A

Azure deployment

Prompt tuning

Build docs developers (and LLMs) love

Language model selection

Basic configuration

Language-specific prompt tuning

Processing mixed-language documents

Querying in multiple languages

Best practices for specific languages

Example: Multi-lingual research corpus

Language detection and routing

Troubleshooting

Supported languages

Next steps