Skip to main content
Ollama enables running open-source LLMs locally for privacy-focused applications, offline deployment, and cost-free inference.

Installation

Ollama support is included in the base installation:
pip install graphiti-core

Prerequisites

Install Ollama

Download and install Ollama: Start Ollama:
ollama serve

Pull Models

Download the models you’ll use:
# Pull LLM model
ollama pull deepseek-r1:7b

# Pull embedding model
ollama pull nomic-embed-text

Configuration

Environment Variables

.env
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_LLM_MODEL=deepseek-r1:7b
OLLAMA_EMBEDDING_MODEL=nomic-embed-text

Basic Setup

Initialize Graphiti with Ollama:
import os
from graphiti_core import Graphiti
from graphiti_core.llm_client.config import LLMConfig
from graphiti_core.llm_client.openai_generic_client import OpenAIGenericClient
from graphiti_core.embedder.openai import OpenAIEmbedder, OpenAIEmbedderConfig
from graphiti_core.cross_encoder.openai_reranker_client import OpenAIRerankerClient

# Configure Ollama LLM client
llm_config = LLMConfig(
    api_key="ollama",  # Placeholder (required but not used)
    model=os.getenv("OLLAMA_LLM_MODEL", "deepseek-r1:7b"),
    small_model=os.getenv("OLLAMA_LLM_MODEL", "deepseek-r1:7b"),
    base_url=os.getenv("OLLAMA_BASE_URL", "http://localhost:11434/v1")
)

llm_client = OpenAIGenericClient(config=llm_config)

# Configure Ollama embedder
embedder = OpenAIEmbedder(
    config=OpenAIEmbedderConfig(
        api_key="ollama",  # Placeholder
        embedding_model=os.getenv("OLLAMA_EMBEDDING_MODEL", "nomic-embed-text"),
        embedding_dim=768,  # nomic-embed-text dimension
        base_url=os.getenv("OLLAMA_BASE_URL", "http://localhost:11434/v1")
    )
)

# Configure cross-encoder (reranker)
cross_encoder = OpenAIRerankerClient(
    client=llm_client,
    config=llm_config
)

# Initialize Graphiti
graphiti = Graphiti(
    "bolt://localhost:7687",
    "neo4j",
    "password",
    llm_client=llm_client,
    embedder=embedder,
    cross_encoder=cross_encoder
)

Important Notes

Use OpenAIGenericClient

Always use OpenAIGenericClient for Ollama, not OpenAIClient:
# ✓ Correct
from graphiti_core.llm_client.openai_generic_client import OpenAIGenericClient
llm_client = OpenAIGenericClient(config=llm_config)

# ✗ Wrong
from graphiti_core.llm_client.openai_client import OpenAIClient
llm_client = OpenAIClient(config=llm_config)  # May have issues with local models
Why OpenAIGenericClient?
  • Higher default max tokens (16K vs 8K)
  • Better compatibility with local models
  • Full structured output support
  • Optimized for OpenAI-compatible APIs

Ollama API Endpoint

Ollama provides an OpenAI-compatible API at:
http://localhost:11434/v1
This endpoint implements the OpenAI API format, enabling compatibility with OpenAI client libraries.

Language Models

  • deepseek-r1:7b (recommended): Fast reasoning model, 7B parameters
  • qwen2.5:7b: Strong general-purpose model
  • llama3.3:70b: High quality, requires more resources
  • gemma2:9b: Efficient Google model
  • mistral:7b: Fast and capable

Embedding Models

  • nomic-embed-text (recommended): 768 dimensions, excellent quality
  • mxbai-embed-large: 1024 dimensions, high quality
  • all-minilm: 384 dimensions, lightweight

Model Selection Guide

ModelSizeRAM NeededSpeedQuality
deepseek-r1:7b4.7GB8GBFastGood
qwen2.5:7b4.7GB8GBFastGood
llama3.3:70b40GB64GBSlowExcellent
gemma2:9b5.5GB10GBMediumGood

Configuration Options

LLM Client

ParameterTypeDefaultDescription
api_keystr"ollama"Placeholder (required but unused)
modelstrRequiredOllama model name
small_modelstrSame as modelModel for simpler tasks
base_urlstr"http://localhost:11434/v1"Ollama API endpoint
temperaturefloat0.7Sampling temperature
max_tokensint16384Maximum output tokens

Embedder

ParameterTypeDefaultDescription
api_keystr"ollama"Placeholder (required but unused)
embedding_modelstrRequiredOllama embedding model
embedding_dimintModel-specificOutput dimensions
base_urlstr"http://localhost:11434/v1"Ollama API endpoint

Complete Example

import asyncio
import os
from datetime import datetime, timezone
from graphiti_core import Graphiti
from graphiti_core.llm_client.config import LLMConfig
from graphiti_core.llm_client.openai_generic_client import OpenAIGenericClient
from graphiti_core.embedder.openai import OpenAIEmbedder, OpenAIEmbedderConfig
from graphiti_core.cross_encoder.openai_reranker_client import OpenAIRerankerClient
from graphiti_core.nodes import EpisodeType

async def main():
    # Configure Ollama LLM
    llm_config = LLMConfig(
        api_key="ollama",
        model="deepseek-r1:7b",
        small_model="deepseek-r1:7b",
        base_url="http://localhost:11434/v1"
    )
    
    llm_client = OpenAIGenericClient(config=llm_config)
    
    # Configure Ollama embedder
    embedder = OpenAIEmbedder(
        config=OpenAIEmbedderConfig(
            api_key="ollama",
            embedding_model="nomic-embed-text",
            embedding_dim=768,
            base_url="http://localhost:11434/v1"
        )
    )
    
    # Configure cross-encoder
    cross_encoder = OpenAIRerankerClient(
        client=llm_client,
        config=llm_config
    )
    
    # Initialize Graphiti
    graphiti = Graphiti(
        "bolt://localhost:7687",
        "neo4j",
        "password",
        llm_client=llm_client,
        embedder=embedder,
        cross_encoder=cross_encoder
    )
    
    try:
        # Add an episode
        await graphiti.add_episode(
            name="Local AI Test",
            episode_body="Ollama enables running LLMs locally for privacy and offline use.",
            source=EpisodeType.text,
            reference_time=datetime.now(timezone.utc)
        )
        print("Added episode using local Ollama model")
        
        # Search the graph
        results = await graphiti.search("What are the benefits of local LLMs?")
        for result in results:
            print(f"Fact: {result.fact}")
    
    finally:
        await graphiti.close()

if __name__ == "__main__":
    asyncio.run(main())

Structured Output Limitations

Local models may have challenges with structured outputs: Best Practices:
  • Use larger models (7B+) for better structured output adherence
  • Enable JSON mode in Ollama modelfile if available
  • Monitor extraction quality and adjust prompts if needed
  • Consider using quantized versions for faster inference

Performance Optimization

Hardware Acceleration

GPU Support:
# Ollama automatically detects and uses GPU if available
# NVIDIA GPU: CUDA support
# Apple Silicon: Metal support
# AMD GPU: ROCm support

Concurrency Control

Local models are slower than cloud APIs. Reduce concurrency:
.env
SEMAPHORE_LIMIT=2  # Lower concurrency for local models

Model Optimization

  • Use Quantized Models: Faster inference, lower memory
  • Tune Context Length: Balance quality vs speed
  • Batch Requests: Process multiple items together

When to Use Ollama

Choose Ollama if you:
  • Need complete data privacy (no external API calls)
  • Want offline operation
  • Prefer zero API costs
  • Have capable local hardware (GPU recommended)
  • Need air-gapped deployment
Choose Cloud APIs if you:
  • Need the highest quality outputs
  • Want faster response times
  • Don’t have powerful local hardware
  • Need enterprise support and SLAs

Troubleshooting

Ollama Not Running

# Start Ollama server
ollama serve

# Check if models are available
ollama list

Model Not Found

# Pull the model
ollama pull deepseek-r1:7b

Slow Performance

  • Use GPU: Ensure GPU acceleration is enabled
  • Reduce Concurrency: Set SEMAPHORE_LIMIT=1
  • Use Smaller Models: Try 7B instead of 70B
  • Quantization: Use quantized model variants

Out of Memory

  • Use Smaller Models: Switch to 7B or smaller
  • Increase Swap: Configure system swap space
  • Reduce Context: Lower max_tokens parameter

Build docs developers (and LLMs) love