Skip to main content
GraphRAG supports several input formats to simplify ingesting your data. This page discusses the mechanics and features available for input files and text chunking.

Input loading and schema

All input formats are loaded within GraphRAG and passed to the indexing pipeline as a documents DataFrame. This DataFrame has a row for each document using a shared column schema:
ColumnTypeDescription
idstrID of the document. Generated using a hash of the text content to ensure stability across runs.
textstrThe full text of the document.
titlestrName of the document. Some formats allow this to be configured.
creation_datestrThe creation date of the document, represented as an ISO8601 string. Harvested from the source file system.
metadatadictOptional additional document metadata.
See the outputs documentation for the final documents table schema saved to Parquet after pipeline completion.

Bring your own DataFrame

GraphRAG’s indexing API allows you to pass in your own pandas DataFrame and bypass all input loading/parsing.
import pandas as pd
from graphrag.api.index import build_index
from graphrag.config.models.graph_rag_config import GraphRagConfig

# Create your custom DataFrame
documents = pd.DataFrame({
    'id': ['doc1', 'doc2'],
    'text': ['Document 1 text...', 'Document 2 text...'],
    'title': ['Document 1', 'Document 2'],
    'creation_date': ['2024-01-01T00:00:00Z', '2024-01-02T00:00:00Z'],
    'metadata': [{}, {}]
})

config = GraphRagConfig.from_yaml("settings.yaml")

# Pass your DataFrame to the indexer
results = await build_index(
    config=config,
    input_documents=documents,  # Your custom DataFrame
    verbose=True
)
You must ensure that your input DataFrame conforms to the schema described above. All chunking behavior will proceed the same way as with file-based inputs.

Custom file handling

GraphRAG uses an injectable InputReader provider class. You can implement any input file handling you want in a class that extends InputReader and register it with the InputReaderFactory.
from graphrag.index.input import InputReader, InputReaderFactory
import pandas as pd

class PDFReader(InputReader):
    """Custom PDF input reader."""
    
    async def read(self, path: str) -> pd.DataFrame:
        # Your PDF parsing logic here
        documents = []
        # ... parse PDFs ...
        return pd.DataFrame(documents)

# Register your custom reader
InputReaderFactory.register("pdf", PDFReader)
See the architecture page for more info on the standard provider pattern.

Supported formats

GraphRAG supports three file formats out-of-the-box, covering the overwhelming majority of use cases.
Plain text files (typically ending in .txt file extension).
  • The entire file contents become the text field
  • The title is always the filename
  • Simplest format for getting started
Example:
article.txt
This is the content of my article.
It can span multiple lines and paragraphs.
Configuration:
settings.yaml
input:
  type: text
  base_dir: "./input"

Metadata

With structured file formats (CSV and JSON), you can configure any number of columns to be added to a persisted metadata field in the DataFrame.

Configuration

settings.yaml
input:
  metadata: [title, tag, author]  # List of column names to collect
If configured, the output metadata column will have a dict containing a key for each column and the value of that column for that document.

Example

software.csv:
text,title,tag
My first program,Hello World,tutorial
An early space shooter game,Space Invaders,arcade
settings.yaml:
input:
  metadata: [title, tag]

Chunking and metadata

As described on the dataflow page, documents are chunked into smaller “text units” for processing because document content size often exceeds the available context window for language models.

Chunking configuration

settings.yaml
chunks:
  size: 1200  # Token count (default)
  overlap: 100  # Overlap between chunks
  prepend_metadata: false  # Whether to prepend metadata to each chunk

Metadata prepending

Imagine indexing a collection of news articles where each article starts with a headline and author. When documents are chunked, they are split evenly according to your configured chunk size.
The problem: Front matter at the beginning of the document (like headline and author) is not copied to each chunk. It only exists in the first chunk.
When you later retrieve those chunks for summarization, they may be missing shared information about the source document.

Solution: prepend metadata

You can configure the chunker to copy metadata into each text chunk:
1

Configure metadata columns

Specify which columns to include as metadata during document import.
settings.yaml
input:
  metadata: [title, author]
2

Enable prepend_metadata

Instruct the chunker to copy metadata to the start of every text chunk.
settings.yaml
chunks:
  size: 100
  overlap: 0
  prepend_metadata: true
Metadata is copied as key: value pairs on new lines at the beginning of each chunk.

Chunking examples

Input files:US to lift most federal COVID-19 vaccine mandates.txt:
WASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday...
Configuration:
settings.yaml
input:
  type: text
  metadata: [title]  # filename becomes metadata

chunks:
  size: 100
  overlap: 0
  prepend_metadata: true
Result chunks:
Chunk 1:
title: US to lift most federal COVID-19 vaccine mandates.txt
WASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday. Vaccine requirements for federal workers and federal contractors...

Chunk 2:
title: US to lift most federal COVID-19 vaccine mandates.txt
the deadly virus raged, and their end marks the latest display of how President Joe Biden's administration is moving to treat COVID-19 as a routine, endemic illness...
The title (filename) is prepended to each chunk but not included in the computed chunk size.

Best practices

  • 1200 tokens (default): Good balance for most use cases
  • 300-600 tokens: Better for precise entity extraction
  • 50-100 tokens: Recommended for FastGraphRAG
  • Consider your model’s context window
  • Include metadata that provides context across all chunks
  • Use prepend_metadata when chunks need document-level context
  • Common metadata: title, author, date, category, source
  • 0 tokens: Fastest processing, no redundancy
  • 50-100 tokens: Better context preservation
  • 10-20%: Good rule of thumb (e.g., 100 tokens for 1000 token chunks)
  • Text: Simplest, best for unstructured content
  • CSV: Best for structured data with metadata
  • JSON: Best for complex nested metadata
  • Use custom DataFrame for unsupported formats

Next steps

Outputs

Learn about the Parquet output formats

Data flow

See how inputs flow through the pipeline

Configuration

Configure all indexing parameters

Build docs developers (and LLMs) love