Skip to main content
GraphRAG supports processing and querying documents in multiple languages by leveraging multilingual language models. This guide shows you how to configure GraphRAG for non-English content and mixed-language datasets.

Language model selection

The key to multi-lingual support is choosing models that support your target languages:

GPT-4 and GPT-4 Turbo

Excellent support for 50+ languages including European, Asian, and Middle Eastern languages

GPT-3.5 Turbo

Good support for major languages, more cost-effective for large-scale processing

Multilingual embeddings

text-embedding-3-small and text-embedding-3-large support 100+ languages

Azure OpenAI

Same language support with additional compliance and regional deployment options

Basic configuration

No special configuration is needed for most languages. The default settings work well:
settings.yaml
completion_models:
  default_completion_model:
    model_provider: openai
    model: gpt-4  # Supports multiple languages out of the box
    api_key: ${GRAPHRAG_API_KEY}

embedding_models:
  default_embedding_model:
    model_provider: openai
    model: text-embedding-3-small  # Multilingual embeddings
    api_key: ${GRAPHRAG_API_KEY}

Language-specific prompt tuning

For better results, customize prompts in your target language:
prompts/entity_extraction_es.txt
-Objetivo-
Dado un documento de texto, identifique todas las entidades y sus relaciones.

-Pasos-
1. Identifique todas las entidades en el texto
2. Para cada entidad, extraiga los atributos relevantes
3. Identifique las relaciones entre entidades
4. Formatee la salida según lo especificado

-Tipos de Entidad-
- PERSONA: Individuos humanos
- ORGANIZACIÓN: Empresas, instituciones
- UBICACIÓN: Lugares físicos o virtuales
- EVENTO: Sucesos significativos

Processing mixed-language documents

When working with documents in multiple languages:
1

Organize by language

Optionally separate documents by language for better tracking:
input/
├── en/
   ├── document1.txt
   └── document2.txt
├── es/
   ├── documento1.txt
   └── documento2.txt
└── fr/
    ├── document1.txt
    └── document2.txt
2

Use language-agnostic prompts

For truly mixed content, use English prompts with instructions to handle multiple languages:
-Goal-
Given a text document that may contain content in multiple languages,
identify all entities and relationships. Preserve entity names in their
original language and provide descriptions in the source language.

-Instructions-
- Maintain entity names in original language
- Generate descriptions in the same language as the source text
- Identify language of each extracted entity
3

Configure language metadata

Track language information in your chunking configuration:
chunking:
  prepend_metadata: ["language", "source_file"]

Querying in multiple languages

GraphRAG supports queries in different languages:
graphrag query "What are the main themes in this dataset?"
The LLM will attempt to respond in the same language as your query. For mixed-language datasets, you can specify the desired response language in your query.

Best practices for specific languages

  • Use larger chunk sizes due to character density
  • Consider using gpt-4 for better understanding of classical vs. modern Chinese
  • Test entity extraction with both simplified and traditional characters
chunking:
  size: 800  # Increase from default 400
  overlap: 200  # Increase overlap proportionally
  • Account for mixed scripts (Hiragana, Katakana, Kanji)
  • Use character-based rather than token-based chunking
  • Test entity extraction with company names (often use Kanji)
chunking:
  type: sentence  # Better for Japanese text
  size: 600
  • Ensure text encoding is UTF-8
  • Be aware that some entity names may be transliterated
  • Test with mixed RTL/LTR content (common in technical documents)
input:
  encoding: utf-8
  • Long compound words may need special handling
  • Consider noun capitalization in entity extraction
  • Adjust chunk sizes for longer words
chunking:
  size: 500  # Slightly larger for compound words

Example: Multi-lingual research corpus

Here’s a complete example for processing academic papers in multiple languages:
1

Prepare data

Organize papers with language metadata:
input/papers.csv
id,title,text,language,author,year
1,"Machine Learning Basics","Full text...",en,"Smith",2023
2,"Apprentissage Automatique","Texte complet...",fr,"Dubois",2023
3,"機械学習の基礎","全文...",ja,"田中",2023
2

Configure for CSV input

settings.yaml
input:
  type: csv
  file_pattern: .*\.csv$
  id_column: id
  title_column: title
  text_column: text

chunking:
  prepend_metadata: ["language", "author", "year"]
  size: 600
  overlap: 100
3

Create multilingual prompts

prompts/entity_extraction_multilingual.txt
-Goal-
Extract academic entities from research papers in any language.
Preserve technical terms and names in their original language.

-Entity Types-
- RESEARCHER: Authors and cited researchers
- CONCEPT: Scientific concepts and methods
- INSTITUTION: Universities and research organizations
- PUBLICATION: Papers, journals, conferences

-Instructions-
- Keep researcher names in original form
- Preserve technical terminology in source language
- Translate descriptions to match source document language
4

Query across languages

# Query in English
graphrag query "What machine learning concepts are discussed?" --method global

# Query in French
graphrag query "Quels chercheurs sont mentionnés?" --method local

# Query in Japanese  
graphrag query "どの研究機関が関与していますか?" --method local

Language detection and routing

For advanced use cases, implement language detection:
from langdetect import detect
import pandas as pd

# Detect language in documents
df = pd.read_csv('input/documents.csv')
df['detected_language'] = df['text'].apply(detect)

# Route to language-specific prompts
language_prompts = {
    'en': 'prompts/entity_extraction_en.txt',
    'es': 'prompts/entity_extraction_es.txt',
    'fr': 'prompts/entity_extraction_fr.txt',
    'ja': 'prompts/entity_extraction_ja.txt',
}

Troubleshooting

Solutions:
  • Use language-specific prompts
  • Run auto prompt tuning with documents in target language
  • Increase chunk size for languages with longer words/characters
  • Verify model supports your target language well
Solutions:
  • Explicitly specify response language in your query
  • Use language-specific system prompts
  • Separate documents by language during indexing
Solutions:
  • Ensure all files are UTF-8 encoded
  • Verify .env file doesn’t have encoding issues
  • Check that storage systems support Unicode

Supported languages

OpenAI models (GPT-4, GPT-3.5-turbo, embeddings) have strong support for:
  • English, Spanish, French, German, Italian
  • Portuguese, Dutch, Polish, Russian
  • Swedish, Norwegian, Danish, Finnish
  • Greek, Turkish, Czech, Romanian

Next steps

Custom prompts

Create language-specific prompts

Document Q&A

Build multilingual Q&A systems

Azure deployment

Deploy with regional Azure endpoints

Prompt tuning

Auto-tune prompts for your language

Build docs developers (and LLMs) love