GraphRAG supports multiple storage backends for different parts of the pipeline. You can use local files, Azure Blob Storage, or Azure Cosmos DB depending on your requirements.
Storage types
GraphRAG uses storage in four main areas:
Input storage Where source documents are read from
Output storage Where processed artifacts are written
Cache storage Where LLM responses are cached
Vector storage Where embeddings are stored for search
Configure where GraphRAG reads source documents:
File storage (default)
input_storage :
type : file
base_dir : "input"
Place your documents in the input/ directory:
project/
├── input/
│ ├── document1.txt
│ ├── document2.txt
│ └── data.csv
├── settings.yaml
└── .env
Azure Blob Storage
input_storage :
type : blob
connection_string : ${AZURE_STORAGE_CONNECTION_STRING}
container_name : graphrag-input
account_url : https://myaccount.blob.core.windows.net/
Azure Cosmos DB
input_storage :
type : cosmosdb
connection_string : ${COSMOS_CONNECTION_STRING}
container_name : graphrag-input
database_name : graphrag
account_url : https://myaccount.documents.azure.com:443/
Output storage
Configure where GraphRAG writes processed artifacts:
File storage (default)
output_storage :
type : file
base_dir : "output"
Outputs are organized by artifact type:
project/
└── output/
├── create_base_text_units.parquet
├── create_final_entities.parquet
├── create_final_relationships.parquet
├── create_final_communities.parquet
├── create_final_community_reports.parquet
└── lancedb/ # Vector store
Azure Blob Storage
output_storage :
type : blob
connection_string : ${AZURE_STORAGE_CONNECTION_STRING}
container_name : graphrag-output
account_url : https://myaccount.blob.core.windows.net/
Azure Cosmos DB
output_storage :
type : cosmosdb
connection_string : ${COSMOS_CONNECTION_STRING}
container_name : graphrag-output
database_name : graphrag
account_url : https://myaccount.documents.azure.com:443/
Update output storage
For incremental indexing, specify a separate output location:
update_output_storage :
type : file
base_dir : "update_output"
This preserves original outputs when re-indexing with updated data.
Cache storage
Caching stores LLM responses to avoid redundant API calls:
JSON cache (default)
cache :
type : json
storage :
type : file
base_dir : "cache"
The cache directory structure:
project/
└── cache/
├── extract_graph/
│ └── responses.json
├── summarize_descriptions/
│ └── responses.json
└── community_reporting/
└── responses.json
Memory cache
For temporary caching (not persisted):
Memory cache is lost when the process ends. Only use for testing.
Disable caching
Azure Blob cache
cache :
type : json
storage :
type : blob
connection_string : ${AZURE_STORAGE_CONNECTION_STRING}
container_name : graphrag-cache
account_url : https://myaccount.blob.core.windows.net/
Azure Cosmos DB cache
cache :
type : json
storage :
type : cosmosdb
connection_string : ${COSMOS_CONNECTION_STRING}
container_name : graphrag-cache
database_name : graphrag
account_url : https://myaccount.documents.azure.com:443/
Vector storage
Configure where embeddings are stored for similarity search:
LanceDB (default)
vector_store :
type : lancedb
db_uri : output/lancedb
LanceDB provides fast vector search with automatic indexing and works well for most use cases.
Azure AI Search
vector_store :
type : azure_ai_search
url : https://my-search-service.search.windows.net
api_key : ${AZURE_SEARCH_API_KEY}
audience : https://search.azure.com/.default # for managed identity
Azure Cosmos DB
vector_store :
type : cosmosdb
url : https://myaccount.documents.azure.com:443/
connection_string : ${COSMOS_CONNECTION_STRING}
database_name : graphrag
Custom index schema
Customize field names and vector sizes:
vector_store :
type : lancedb
db_uri : output/lancedb
index_schema :
text_unit_text :
index_name : "text-unit-embeddings"
id_field : "id_custom"
vector_field : "vector_custom"
vector_size : 3072
entity_description :
index_name : "entity-embeddings"
id_field : "id"
vector_field : "vector"
vector_size : 3072
community_full_content :
index_name : "community-embeddings"
vector_size : 3072
Name for the embedding index/table
Field name for document IDs
Field name for embedding vectors
Dimension of embedding vectors (must match model)
Reporting storage
Configure where pipeline logs and reports are written:
File reporting (default)
reporting :
type : file
base_dir : "logs"
Azure Blob reporting
reporting :
type : blob
connection_string : ${AZURE_STORAGE_CONNECTION_STRING}
container_name : graphrag-logs
storage_account_blob_url : https://myaccount.blob.core.windows.net/
Storage parameters
Common parameters
Storage backend: file, blob, memory, or cosmosdb
Character encoding for file operations
Base directory for file storage (relative to project root)
Azure Blob parameters
Azure Storage connection string (use environment variable)
Name of the blob container
Azure Cosmos DB parameters
Cosmos DB connection string (use environment variable)
Name of the Cosmos container
Name of the Cosmos database
Example: Full Azure configuration
Using Azure services for all storage:
# Environment variables in .env
# AZURE_STORAGE_CONNECTION_STRING=DefaultEndpointsProtocol=https;...
# COSMOS_CONNECTION_STRING=AccountEndpoint=https://...;AccountKey=...;
# AZURE_SEARCH_API_KEY=your-search-api-key
input_storage :
type : blob
connection_string : ${AZURE_STORAGE_CONNECTION_STRING}
container_name : graphrag-input
account_url : https://mystorage.blob.core.windows.net/
output_storage :
type : blob
connection_string : ${AZURE_STORAGE_CONNECTION_STRING}
container_name : graphrag-output
account_url : https://mystorage.blob.core.windows.net/
update_output_storage :
type : blob
connection_string : ${AZURE_STORAGE_CONNECTION_STRING}
container_name : graphrag-update-output
account_url : https://mystorage.blob.core.windows.net/
cache :
type : json
storage :
type : blob
connection_string : ${AZURE_STORAGE_CONNECTION_STRING}
container_name : graphrag-cache
account_url : https://mystorage.blob.core.windows.net/
vector_store :
type : azure_ai_search
url : https://my-search.search.windows.net
api_key : ${AZURE_SEARCH_API_KEY}
reporting :
type : blob
connection_string : ${AZURE_STORAGE_CONNECTION_STRING}
container_name : graphrag-logs
storage_account_blob_url : https://mystorage.blob.core.windows.net/
Example: Hybrid configuration
Using local files for development with cloud caching:
input_storage :
type : file
base_dir : "input"
output_storage :
type : file
base_dir : "output"
cache :
type : json
storage :
type : blob
connection_string : ${AZURE_STORAGE_CONNECTION_STRING}
container_name : graphrag-cache
account_url : https://mystorage.blob.core.windows.net/
vector_store :
type : lancedb
db_uri : output/lancedb
reporting :
type : file
base_dir : "logs"
Best practices
Use environment variables for credentials
Never commit connection strings or API keys to version control. Store them in .env files and reference with ${VAR_NAME} syntax.
Enable caching for development
Use JSON file cache during development to avoid redundant API calls. Consider blob cache for team sharing.
Separate production and test storage
Use different containers/directories for production and testing to prevent data mixing.
Azure Blob and Cosmos DB incur storage and transaction costs. Monitor usage and consider file storage for development.
Back up cache and outputs
Cache files can save significant API costs. Back them up before clearing or re-indexing.
Next steps
Caching Learn more about cache configuration and optimization
Settings reference Complete configuration options
LLM models Configure language models
Start indexing Begin processing documents