Storage - GraphRAG

GraphRAG supports multiple storage backends for different parts of the pipeline. You can use local files, Azure Blob Storage, or Azure Cosmos DB depending on your requirements.

Storage types

GraphRAG uses storage in four main areas:

Input storage

Where source documents are read from

Output storage

Where processed artifacts are written

Cache storage

Where LLM responses are cached

Vector storage

Where embeddings are stored for search

Input storage

Configure where GraphRAG reads source documents:

File storage (default)

input_storage:
  type: file
  base_dir: "input"

Place your documents in the input/ directory:

project/
├── input/
│   ├── document1.txt
│   ├── document2.txt
│   └── data.csv
├── settings.yaml
└── .env

Azure Blob Storage

input_storage:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: graphrag-input
  account_url: https://myaccount.blob.core.windows.net/

Azure Cosmos DB

input_storage:
  type: cosmosdb
  connection_string: ${COSMOS_CONNECTION_STRING}
  container_name: graphrag-input
  database_name: graphrag
  account_url: https://myaccount.documents.azure.com:443/

Output storage

Configure where GraphRAG writes processed artifacts:

File storage (default)

output_storage:
  type: file
  base_dir: "output"

Outputs are organized by artifact type:

project/
└── output/
    ├── create_base_text_units.parquet
    ├── create_final_entities.parquet
    ├── create_final_relationships.parquet
    ├── create_final_communities.parquet
    ├── create_final_community_reports.parquet
    └── lancedb/  # Vector store

Azure Blob Storage

output_storage:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: graphrag-output
  account_url: https://myaccount.blob.core.windows.net/

Azure Cosmos DB

output_storage:
  type: cosmosdb
  connection_string: ${COSMOS_CONNECTION_STRING}
  container_name: graphrag-output
  database_name: graphrag
  account_url: https://myaccount.documents.azure.com:443/

Update output storage

For incremental indexing, specify a separate output location:

update_output_storage:
  type: file
  base_dir: "update_output"

This preserves original outputs when re-indexing with updated data.

Cache storage

Caching stores LLM responses to avoid redundant API calls:

JSON cache (default)

cache:
  type: json
  storage:
    type: file
    base_dir: "cache"

The cache directory structure:

project/
└── cache/
    ├── extract_graph/
    │   └── responses.json
    ├── summarize_descriptions/
    │   └── responses.json
    └── community_reporting/
        └── responses.json

Memory cache

For temporary caching (not persisted):

cache:
  type: memory

Memory cache is lost when the process ends. Only use for testing.

Disable caching

cache:
  type: none

Azure Blob cache

cache:
  type: json
  storage:
    type: blob
    connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
    container_name: graphrag-cache
    account_url: https://myaccount.blob.core.windows.net/

Azure Cosmos DB cache

cache:
  type: json
  storage:
    type: cosmosdb
    connection_string: ${COSMOS_CONNECTION_STRING}
    container_name: graphrag-cache
    database_name: graphrag
    account_url: https://myaccount.documents.azure.com:443/

Vector storage

Configure where embeddings are stored for similarity search:

LanceDB (default)

vector_store:
  type: lancedb
  db_uri: output/lancedb

LanceDB provides fast vector search with automatic indexing and works well for most use cases.

Azure AI Search

vector_store:
  type: azure_ai_search
  url: https://my-search-service.search.windows.net
  api_key: ${AZURE_SEARCH_API_KEY}
  audience: https://search.azure.com/.default  # for managed identity

Azure Cosmos DB

vector_store:
  type: cosmosdb
  url: https://myaccount.documents.azure.com:443/
  connection_string: ${COSMOS_CONNECTION_STRING}
  database_name: graphrag

Custom index schema

Customize field names and vector sizes:

vector_store:
  type: lancedb
  db_uri: output/lancedb
  index_schema:
    text_unit_text:
      index_name: "text-unit-embeddings"
      id_field: "id_custom"
      vector_field: "vector_custom"
      vector_size: 3072
    entity_description:
      index_name: "entity-embeddings"
      id_field: "id"
      vector_field: "vector"
      vector_size: 3072
    community_full_content:
      index_name: "community-embeddings"
      vector_size: 3072

index_name

string

Name for the embedding index/table

id_field

string

default:"id"

Field name for document IDs

vector_field

string

default:"vector"

Field name for embedding vectors

vector_size

integer

default:"3072"

Dimension of embedding vectors (must match model)

Reporting storage

Configure where pipeline logs and reports are written:

File reporting (default)

reporting:
  type: file
  base_dir: "logs"

Azure Blob reporting

reporting:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: graphrag-logs
  storage_account_blob_url: https://myaccount.blob.core.windows.net/

Storage parameters

Common parameters

type

string

required

Storage backend: file, blob, memory, or cosmosdb

encoding

string

default:"utf-8"

Character encoding for file operations

base_dir

string

Base directory for file storage (relative to project root)

Azure Blob parameters

connection_string

string

required

Azure Storage connection string (use environment variable)

container_name

string

required

Name of the blob container

account_url

string

Storage account blob URL

Azure Cosmos DB parameters

connection_string

string

required

Cosmos DB connection string (use environment variable)

container_name

string

required

Name of the Cosmos container

database_name

string

required

Name of the Cosmos database

account_url

string

Cosmos DB account URL

Example: Full Azure configuration

Using Azure services for all storage:

# Environment variables in .env
# AZURE_STORAGE_CONNECTION_STRING=DefaultEndpointsProtocol=https;...
# COSMOS_CONNECTION_STRING=AccountEndpoint=https://...;AccountKey=...;
# AZURE_SEARCH_API_KEY=your-search-api-key

input_storage:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: graphrag-input
  account_url: https://mystorage.blob.core.windows.net/

output_storage:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: graphrag-output
  account_url: https://mystorage.blob.core.windows.net/

update_output_storage:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: graphrag-update-output
  account_url: https://mystorage.blob.core.windows.net/

cache:
  type: json
  storage:
    type: blob
    connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
    container_name: graphrag-cache
    account_url: https://mystorage.blob.core.windows.net/

vector_store:
  type: azure_ai_search
  url: https://my-search.search.windows.net
  api_key: ${AZURE_SEARCH_API_KEY}

reporting:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: graphrag-logs
  storage_account_blob_url: https://mystorage.blob.core.windows.net/

Example: Hybrid configuration

Using local files for development with cloud caching:

input_storage:
  type: file
  base_dir: "input"

output_storage:
  type: file
  base_dir: "output"

cache:
  type: json
  storage:
    type: blob
    connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
    container_name: graphrag-cache
    account_url: https://mystorage.blob.core.windows.net/

vector_store:
  type: lancedb
  db_uri: output/lancedb

reporting:
  type: file
  base_dir: "logs"

Best practices

Use environment variables for credentials

Never commit connection strings or API keys to version control. Store them in .env files and reference with ${VAR_NAME} syntax.

Enable caching for development

Use JSON file cache during development to avoid redundant API calls. Consider blob cache for team sharing.

Separate production and test storage

Use different containers/directories for production and testing to prevent data mixing.

Monitor storage costs

Azure Blob and Cosmos DB incur storage and transaction costs. Monitor usage and consider file storage for development.

Back up cache and outputs

Cache files can save significant API costs. Back them up before clearing or re-indexing.

Next steps

Caching

Learn more about cache configuration and optimization

Settings reference

Complete configuration options

LLM models

Configure language models

Start indexing

Begin processing documents

Get Started

Core Concepts

Indexing

Query Engine

Prompt Tuning

Configuration

Guides

​Storage types

Input storage

Output storage

Cache storage

Vector storage

​Input storage

​File storage (default)

​Azure Blob Storage

​Azure Cosmos DB

​Output storage

​File storage (default)

​Azure Blob Storage

​Azure Cosmos DB

​Update output storage

​Cache storage

​JSON cache (default)

​Memory cache

​Disable caching

​Azure Blob cache

​Azure Cosmos DB cache

​Vector storage

​LanceDB (default)

​Azure AI Search

​Azure Cosmos DB

​Custom index schema

​Reporting storage

​File reporting (default)

​Azure Blob reporting

​Storage parameters

​Common parameters

​Azure Blob parameters

​Azure Cosmos DB parameters

​Example: Full Azure configuration

​Example: Hybrid configuration

​Best practices

​Next steps

Caching

Settings reference

LLM models

Start indexing

Build docs developers (and LLMs) love

Storage types

Input storage

File storage (default)

Azure Blob Storage

Azure Cosmos DB

Output storage

File storage (default)

Azure Blob Storage

Azure Cosmos DB

Update output storage

Cache storage

JSON cache (default)

Memory cache

Disable caching

Azure Blob cache

Azure Cosmos DB cache

Vector storage

LanceDB (default)

Azure AI Search

Azure Cosmos DB

Custom index schema

Reporting storage

File reporting (default)

Azure Blob reporting

Storage parameters

Common parameters

Azure Blob parameters

Azure Cosmos DB parameters

Example: Full Azure configuration

Example: Hybrid configuration

Best practices

Next steps