Overview
Hinbox supports running entity extraction entirely on your local machine using Ollama. This is ideal for:- Sensitive historical research requiring data privacy
- Cost control when processing large document collections
- Offline processing without internet connectivity
- Custom model experimentation with different LLMs
Model Configuration
Hinbox uses environment variables to configure both cloud and local models, defined insrc/constants.py:7-8:
Default Models
- Cloud:
gemini/gemini-2.0-flash(via LiteLLM) - Local:
qwen2.5:32b-instruct-q5_K_M(~23GB download)
Setup
Install Ollama
Download and install Ollama from ollama.com
Pull the Default Model
Download the Qwen 2.5 32B model (default for Hinbox):
This downloads ~23GB. Qwen 2.5 32B provides excellent extraction quality while fitting in 24GB+ VRAM systems.
Configure Context Window
Ollama defaults to conservative context windows. Qwen 2.5 supports up to 131K tokens:
Running with Local Models
Use the--local flag to switch to Ollama:
- Uses the
HINBOX_OLLAMA_MODELfor extraction - Switches to local embeddings
- Disables all LLM telemetry (see Privacy Mode)
API Configuration
By default, Hinbox connects tohttp://localhost:11434/v1. Override via environment variable:
src/constants.py:15:
Switching Models
Using a Different Ollama Model
Override the default without editing code:Supported Model Families
Any Ollama-compatible model works via LiteLLM integration. Popular choices:Qwen 2.5 32B
Default. Excellent quality, fits 24GB VRAM
Llama 3.1 70B
Highest quality, requires 48GB+ VRAM
Gemma 2 27B
Fast and efficient, 16GB+ VRAM
Mistral Large
Balanced performance, 32GB VRAM
Concurrency Settings
Local models have different concurrency limits than cloud APIs. Fromconfigs/guantanamo/config.yaml:50-55:
src/process_and_extract.py:798-809:
Model Name Handling
Ollama API expects bare model names without theollama/ prefix. Hinbox automatically strips it via src/constants.py:20-22:
- Config:
HINBOX_OLLAMA_MODEL=ollama/qwen2.5:32b-instruct-q5_K_M - API call:
qwen2.5:32b-instruct-q5_K_M
Performance Considerations
GPU Memory Requirements
| Model Size | Quantization | VRAM Required | Typical Speed |
|---|---|---|---|
| 7B | Q5_K_M | 6GB | ~50 tok/s |
| 13B | Q5_K_M | 10GB | ~30 tok/s |
| 32B | Q5_K_M | 24GB | ~20 tok/s |
| 70B | Q4_K_M | 48GB | ~10 tok/s |
CPU-Only Mode
Ollama can run on CPU, but extraction will be 10-50x slower:Troubleshooting
Connection Refused
Out of Memory
- Reduce context window:
export OLLAMA_CONTEXT_LENGTH=8192 - Lower concurrency: Set
ollama_in_flight: 1in config.yaml - Use smaller model: Switch to
qwen2.5:14borgemma2:9b
Slow Extraction
Check GPU utilization:- Increase
ollama_in_flight(if you have VRAM headroom) - Verify context window isn’t too large
- Check CPU bottlenecks with
htop
Cloud vs. Local Comparison
| Aspect | Cloud (Gemini) | Local (Ollama) |
|---|---|---|
| Cost | ~$0.50-2 per 1000 docs | Free (electricity only) |
| Privacy | Data sent to Google | 100% local |
| Speed | ~2-5 docs/sec | ~0.5-2 docs/sec (GPU) |
| Setup | API key only | GPU, model download |
| Quality | Excellent | Excellent (32B+ models) |
| Internet | Required | Not required |
Next Steps
Privacy Mode
Learn how
--local enforces complete data privacyPerformance Tuning
Optimize concurrency and batching settings
Quality Controls
Understand extraction QC and retry logic
Caching
Avoid redundant LLM calls with extraction cache