Installation
Ollama support is included in the base installation:Prerequisites
Install Ollama
Download and install Ollama:- macOS/Linux: ollama.ai/download
- Windows: Follow Ollama Windows instructions
Pull Models
Download the models you’ll use:Configuration
Environment Variables
.env
Basic Setup
Initialize Graphiti with Ollama:Important Notes
Use OpenAIGenericClient
Always useOpenAIGenericClient for Ollama, not OpenAIClient:
- Higher default max tokens (16K vs 8K)
- Better compatibility with local models
- Full structured output support
- Optimized for OpenAI-compatible APIs
Ollama API Endpoint
Ollama provides an OpenAI-compatible API at:Recommended Models
Language Models
- deepseek-r1:7b (recommended): Fast reasoning model, 7B parameters
- qwen2.5:7b: Strong general-purpose model
- llama3.3:70b: High quality, requires more resources
- gemma2:9b: Efficient Google model
- mistral:7b: Fast and capable
Embedding Models
- nomic-embed-text (recommended): 768 dimensions, excellent quality
- mxbai-embed-large: 1024 dimensions, high quality
- all-minilm: 384 dimensions, lightweight
Model Selection Guide
| Model | Size | RAM Needed | Speed | Quality |
|---|---|---|---|---|
| deepseek-r1:7b | 4.7GB | 8GB | Fast | Good |
| qwen2.5:7b | 4.7GB | 8GB | Fast | Good |
| llama3.3:70b | 40GB | 64GB | Slow | Excellent |
| gemma2:9b | 5.5GB | 10GB | Medium | Good |
Configuration Options
LLM Client
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | str | "ollama" | Placeholder (required but unused) |
model | str | Required | Ollama model name |
small_model | str | Same as model | Model for simpler tasks |
base_url | str | "http://localhost:11434/v1" | Ollama API endpoint |
temperature | float | 0.7 | Sampling temperature |
max_tokens | int | 16384 | Maximum output tokens |
Embedder
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | str | "ollama" | Placeholder (required but unused) |
embedding_model | str | Required | Ollama embedding model |
embedding_dim | int | Model-specific | Output dimensions |
base_url | str | "http://localhost:11434/v1" | Ollama API endpoint |
Complete Example
Structured Output Limitations
Local models may have challenges with structured outputs: Best Practices:- Use larger models (7B+) for better structured output adherence
- Enable JSON mode in Ollama modelfile if available
- Monitor extraction quality and adjust prompts if needed
- Consider using quantized versions for faster inference
Performance Optimization
Hardware Acceleration
GPU Support:Concurrency Control
Local models are slower than cloud APIs. Reduce concurrency:.env
Model Optimization
- Use Quantized Models: Faster inference, lower memory
- Tune Context Length: Balance quality vs speed
- Batch Requests: Process multiple items together
When to Use Ollama
Choose Ollama if you:- Need complete data privacy (no external API calls)
- Want offline operation
- Prefer zero API costs
- Have capable local hardware (GPU recommended)
- Need air-gapped deployment
- Need the highest quality outputs
- Want faster response times
- Don’t have powerful local hardware
- Need enterprise support and SLAs
Troubleshooting
Ollama Not Running
Model Not Found
Slow Performance
- Use GPU: Ensure GPU acceleration is enabled
- Reduce Concurrency: Set
SEMAPHORE_LIMIT=1 - Use Smaller Models: Try 7B instead of 70B
- Quantization: Use quantized model variants
Out of Memory
- Use Smaller Models: Switch to 7B or smaller
- Increase Swap: Configure system swap space
- Reduce Context: Lower max_tokens parameter