Learn about supported input formats, schemas, and chunking strategies for GraphRAG indexing
GraphRAG supports several input formats to simplify ingesting your data. This page discusses the mechanics and features available for input files and text chunking.
All input formats are loaded within GraphRAG and passed to the indexing pipeline as a documents DataFrame. This DataFrame has a row for each document using a shared column schema:
Column
Type
Description
id
str
ID of the document. Generated using a hash of the text content to ensure stability across runs.
text
str
The full text of the document.
title
str
Name of the document. Some formats allow this to be configured.
creation_date
str
The creation date of the document, represented as an ISO8601 string. Harvested from the source file system.
metadata
dict
Optional additional document metadata.
See the outputs documentation for the final documents table schema saved to Parquet after pipeline completion.
GraphRAG’s indexing API allows you to pass in your own pandas DataFrame and bypass all input loading/parsing.
import pandas as pdfrom graphrag.api.index import build_indexfrom graphrag.config.models.graph_rag_config import GraphRagConfig# Create your custom DataFramedocuments = pd.DataFrame({ 'id': ['doc1', 'doc2'], 'text': ['Document 1 text...', 'Document 2 text...'], 'title': ['Document 1', 'Document 2'], 'creation_date': ['2024-01-01T00:00:00Z', '2024-01-02T00:00:00Z'], 'metadata': [{}, {}]})config = GraphRagConfig.from_yaml("settings.yaml")# Pass your DataFrame to the indexerresults = await build_index( config=config, input_documents=documents, # Your custom DataFrame verbose=True)
You must ensure that your input DataFrame conforms to the schema described above. All chunking behavior will proceed the same way as with file-based inputs.
GraphRAG uses an injectable InputReader provider class. You can implement any input file handling you want in a class that extends InputReader and register it with the InputReaderFactory.
from graphrag.index.input import InputReader, InputReaderFactoryimport pandas as pdclass PDFReader(InputReader): """Custom PDF input reader.""" async def read(self, path: str) -> pd.DataFrame: # Your PDF parsing logic here documents = [] # ... parse PDFs ... return pd.DataFrame(documents)# Register your custom readerInputReaderFactory.register("pdf", PDFReader)
See the architecture page for more info on the standard provider pattern.
input: type: csv base_dir: "./input" text_column: text # defaults to "text" title_column: title # optional metadata: [category] # optional metadata columns
If you don’t configure text_column, it defaults to “text”. If title_column is not configured, the title will be the filename. If an “id” column is present, it will be used; otherwise the ID will be generated from the text hash.
JSON files (typically ending in .json extension) containing structured objects.
As described on the dataflow page, documents are chunked into smaller “text units” for processing because document content size often exceeds the available context window for language models.
Imagine indexing a collection of news articles where each article starts with a headline and author. When documents are chunked, they are split evenly according to your configured chunk size.
The problem: Front matter at the beginning of the document (like headline and author) is not copied to each chunk. It only exists in the first chunk.
When you later retrieve those chunks for summarization, they may be missing shared information about the source document.
Input files:US to lift most federal COVID-19 vaccine mandates.txt:
WASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday...
Chunk 1:title: US to lift most federal COVID-19 vaccine mandates.txtWASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday. Vaccine requirements for federal workers and federal contractors...Chunk 2:title: US to lift most federal COVID-19 vaccine mandates.txtthe deadly virus raged, and their end marks the latest display of how President Joe Biden's administration is moving to treat COVID-19 as a routine, endemic illness...
The title (filename) is prepended to each chunk but not included in the computed chunk size.
Input file (article1.json):
{ "headline": "US to lift most federal COVID-19 vaccine mandates", "content": "WASHINGTON (AP) The Biden administration will end..."}
Configuration:
settings.yaml
input: type: json title_column: headline text_column: contentchunks: size: 100 overlap: 10 # Last 10 tokens are shared between chunks
Result chunks:
Chunk 1 (100 tokens):WASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends...Chunk 2 (starts with 10 tokens from chunk 1):...federal government to promote vaccination as the deadly virus raged, and their end marks the latest display of how President Joe Biden's administration is moving to treat COVID-19...
Overlap helps maintain context between chunks, especially useful for relationship extraction.