Skip to main content
GraphRAG supports multiple storage backends for different parts of the pipeline. You can use local files, Azure Blob Storage, or Azure Cosmos DB depending on your requirements.

Storage types

GraphRAG uses storage in four main areas:

Input storage

Where source documents are read from

Output storage

Where processed artifacts are written

Cache storage

Where LLM responses are cached

Vector storage

Where embeddings are stored for search

Input storage

Configure where GraphRAG reads source documents:

File storage (default)

input_storage:
  type: file
  base_dir: "input"
Place your documents in the input/ directory:
project/
├── input/
│   ├── document1.txt
│   ├── document2.txt
│   └── data.csv
├── settings.yaml
└── .env

Azure Blob Storage

input_storage:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: graphrag-input
  account_url: https://myaccount.blob.core.windows.net/

Azure Cosmos DB

input_storage:
  type: cosmosdb
  connection_string: ${COSMOS_CONNECTION_STRING}
  container_name: graphrag-input
  database_name: graphrag
  account_url: https://myaccount.documents.azure.com:443/

Output storage

Configure where GraphRAG writes processed artifacts:

File storage (default)

output_storage:
  type: file
  base_dir: "output"
Outputs are organized by artifact type:
project/
└── output/
    ├── create_base_text_units.parquet
    ├── create_final_entities.parquet
    ├── create_final_relationships.parquet
    ├── create_final_communities.parquet
    ├── create_final_community_reports.parquet
    └── lancedb/  # Vector store

Azure Blob Storage

output_storage:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: graphrag-output
  account_url: https://myaccount.blob.core.windows.net/

Azure Cosmos DB

output_storage:
  type: cosmosdb
  connection_string: ${COSMOS_CONNECTION_STRING}
  container_name: graphrag-output
  database_name: graphrag
  account_url: https://myaccount.documents.azure.com:443/

Update output storage

For incremental indexing, specify a separate output location:
update_output_storage:
  type: file
  base_dir: "update_output"
This preserves original outputs when re-indexing with updated data.

Cache storage

Caching stores LLM responses to avoid redundant API calls:

JSON cache (default)

cache:
  type: json
  storage:
    type: file
    base_dir: "cache"
The cache directory structure:
project/
└── cache/
    ├── extract_graph/
    │   └── responses.json
    ├── summarize_descriptions/
    │   └── responses.json
    └── community_reporting/
        └── responses.json

Memory cache

For temporary caching (not persisted):
cache:
  type: memory
Memory cache is lost when the process ends. Only use for testing.

Disable caching

cache:
  type: none

Azure Blob cache

cache:
  type: json
  storage:
    type: blob
    connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
    container_name: graphrag-cache
    account_url: https://myaccount.blob.core.windows.net/

Azure Cosmos DB cache

cache:
  type: json
  storage:
    type: cosmosdb
    connection_string: ${COSMOS_CONNECTION_STRING}
    container_name: graphrag-cache
    database_name: graphrag
    account_url: https://myaccount.documents.azure.com:443/

Vector storage

Configure where embeddings are stored for similarity search:

LanceDB (default)

vector_store:
  type: lancedb
  db_uri: output/lancedb
LanceDB provides fast vector search with automatic indexing and works well for most use cases.
vector_store:
  type: azure_ai_search
  url: https://my-search-service.search.windows.net
  api_key: ${AZURE_SEARCH_API_KEY}
  audience: https://search.azure.com/.default  # for managed identity

Azure Cosmos DB

vector_store:
  type: cosmosdb
  url: https://myaccount.documents.azure.com:443/
  connection_string: ${COSMOS_CONNECTION_STRING}
  database_name: graphrag

Custom index schema

Customize field names and vector sizes:
vector_store:
  type: lancedb
  db_uri: output/lancedb
  index_schema:
    text_unit_text:
      index_name: "text-unit-embeddings"
      id_field: "id_custom"
      vector_field: "vector_custom"
      vector_size: 3072
    entity_description:
      index_name: "entity-embeddings"
      id_field: "id"
      vector_field: "vector"
      vector_size: 3072
    community_full_content:
      index_name: "community-embeddings"
      vector_size: 3072
index_name
string
Name for the embedding index/table
id_field
string
default:"id"
Field name for document IDs
vector_field
string
default:"vector"
Field name for embedding vectors
vector_size
integer
default:"3072"
Dimension of embedding vectors (must match model)

Reporting storage

Configure where pipeline logs and reports are written:

File reporting (default)

reporting:
  type: file
  base_dir: "logs"

Azure Blob reporting

reporting:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: graphrag-logs
  storage_account_blob_url: https://myaccount.blob.core.windows.net/

Storage parameters

Common parameters

type
string
required
Storage backend: file, blob, memory, or cosmosdb
encoding
string
default:"utf-8"
Character encoding for file operations
base_dir
string
Base directory for file storage (relative to project root)

Azure Blob parameters

connection_string
string
required
Azure Storage connection string (use environment variable)
container_name
string
required
Name of the blob container
account_url
string
Storage account blob URL

Azure Cosmos DB parameters

connection_string
string
required
Cosmos DB connection string (use environment variable)
container_name
string
required
Name of the Cosmos container
database_name
string
required
Name of the Cosmos database
account_url
string
Cosmos DB account URL

Example: Full Azure configuration

Using Azure services for all storage:
# Environment variables in .env
# AZURE_STORAGE_CONNECTION_STRING=DefaultEndpointsProtocol=https;...
# COSMOS_CONNECTION_STRING=AccountEndpoint=https://...;AccountKey=...;
# AZURE_SEARCH_API_KEY=your-search-api-key

input_storage:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: graphrag-input
  account_url: https://mystorage.blob.core.windows.net/

output_storage:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: graphrag-output
  account_url: https://mystorage.blob.core.windows.net/

update_output_storage:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: graphrag-update-output
  account_url: https://mystorage.blob.core.windows.net/

cache:
  type: json
  storage:
    type: blob
    connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
    container_name: graphrag-cache
    account_url: https://mystorage.blob.core.windows.net/

vector_store:
  type: azure_ai_search
  url: https://my-search.search.windows.net
  api_key: ${AZURE_SEARCH_API_KEY}

reporting:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: graphrag-logs
  storage_account_blob_url: https://mystorage.blob.core.windows.net/

Example: Hybrid configuration

Using local files for development with cloud caching:
input_storage:
  type: file
  base_dir: "input"

output_storage:
  type: file
  base_dir: "output"

cache:
  type: json
  storage:
    type: blob
    connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
    container_name: graphrag-cache
    account_url: https://mystorage.blob.core.windows.net/

vector_store:
  type: lancedb
  db_uri: output/lancedb

reporting:
  type: file
  base_dir: "logs"

Best practices

Never commit connection strings or API keys to version control. Store them in .env files and reference with ${VAR_NAME} syntax.
Use JSON file cache during development to avoid redundant API calls. Consider blob cache for team sharing.
Use different containers/directories for production and testing to prevent data mixing.
Azure Blob and Cosmos DB incur storage and transaction costs. Monitor usage and consider file storage for development.
Cache files can save significant API costs. Back them up before clearing or re-indexing.

Next steps

Caching

Learn more about cache configuration and optimization

Settings reference

Complete configuration options

LLM models

Configure language models

Start indexing

Begin processing documents

Build docs developers (and LLMs) love