Learn how to use your existing graph data with GraphRAG for community detection and query
Several users have asked if they can bring their own existing graph and have it summarized for query with GraphRAG. This page describes a simple method that aligns with the existing GraphRAG workflows.
To cover the basic use cases for GraphRAG query, you should have two or three tables derived from your data:
1
Entities table
The list of entities (nodes) in your graph
2
Relationships table
The list of relationships (edges) in your graph
3
Text units table (optional)
Source text chunks the graph was extracted from. Required for some query methods
The approach is to run a custom GraphRAG workflow pipeline that assumes text chunking, entity extraction, and relationship extraction have already occurred.
For graph summarization purposes, you need the following fields from the full relationships schema:
Field
Type
Required
Description
id
str
Yes
Unique identifier for the relationship
source
str
Yes
Name of the source entity
target
str
Yes
Name of the target entity
description
str
Yes
Description of the relationship
weight
float
Yes
Edge weight (important for Leiden communities!)
text_unit_ids
str[]
Optional
List of source text chunks (if available)
The weight field is critical because it is used to properly compute Leiden communities. Make sure to provide meaningful weights (e.g., 0.0 to 1.0 based on relationship strength).
Example relationships.parquet
import pandas as pdfrom uuid import uuid4# Create your relationships DataFramerelationships = pd.DataFrame([ { "id": str(uuid4()), "source": "Microsoft", "target": "Azure", "description": "Microsoft develops and operates Azure", "weight": 0.95, "text_unit_ids": ["unit1"] }, { "id": str(uuid4()), "source": "Microsoft", "target": "OpenAI", "description": "Microsoft has invested in and partnered with OpenAI", "weight": 0.85, "text_unit_ids": ["unit2"] }])# Write to Parquetrelationships.to_parquet("output/relationships.parquet")
Text units are chunks of your documents sized to fit into the context window of your model. Some search methods use these.See the full text_units schema for all fields.
Example text_units.parquet
import pandas as pdfrom uuid import uuid4# Create your text units DataFrametext_units = pd.DataFrame([ { "id": "unit1", "text": "Microsoft Corporation develops Azure cloud platform...", "n_tokens": 1200, "document_id": "doc1", "entity_ids": ["ent1", "ent2"], "relationship_ids": ["rel1"] }])# Write to Parquettext_units.to_parquet("output/text_units.parquet")
GraphRAG allows you to specify only the specific workflow steps you need. For basic graph summarization and query, configure the following in your settings.yaml:
Create Parquet files for entities and relationships (and optionally text_units) following the schemas above.
import pandas as pdfrom pathlib import Path# Create output directoryoutput_dir = Path("output")output_dir.mkdir(exist_ok=True)# Save your DataFramesentities_df.to_parquet(output_dir / "entities.parquet")relationships_df.to_parquet(output_dir / "relationships.parquet")# text_units_df.to_parquet(output_dir / "text_units.parquet") # if available
2
Configure workflows
Update your settings.yaml to only run the workflows you need:
settings.yaml
workflows: - create_communities - create_community_reports # - generate_text_embeddings # if needed for local/drift searchstorage: type: file base_dir: "output" # Where your parquet files are
3
Run indexing
Run the GraphRAG indexer:
graphrag index --root <your_project_root>
This will:
Skip document loading and graph extraction (already done)
Perform community detection on your existing graph
Generate community reports
(Optionally) generate embeddings
4
Query your graph
Once indexing completes, you can query using GraphRAG:
graphrag query --root <your_project_root> --method global "What are the main themes in this dataset?"
Here’s a complete settings.yaml for bring-your-own-graph scenarios:
settings.yaml
# Minimal configuration for existing graphs# Only run community detection and reportingworkflows: - create_communities - create_community_reports # Uncomment if you need local/drift search: # - generate_text_embeddings# Storage configurationstorage: type: file base_dir: "output"# Community detection settingscluster_graph: max_cluster_size: 10 # Adjust based on your graph size use_lcc: true # Use largest connected component seed: 42 # For reproducible results# LLM settings for community reportsllm: api_key: ${OPENAI_API_KEY} model: gpt-4-turbo-preview max_tokens: 4000# Embedding settings (if using generate_text_embeddings)embeddings: llm: api_key: ${OPENAI_API_KEY} model: text-embedding-3-small