Using Local Models

Overview

Hinbox supports running entity extraction entirely on your local machine using Ollama. This is ideal for:

Sensitive historical research requiring data privacy
Cost control when processing large document collections
Offline processing without internet connectivity
Custom model experimentation with different LLMs

Model Configuration

Hinbox uses environment variables to configure both cloud and local models, defined in src/constants.py:7-8:

CLOUD_MODEL = os.getenv("HINBOX_CLOUD_MODEL", "gemini/gemini-2.0-flash")
OLLAMA_MODEL = os.getenv("HINBOX_OLLAMA_MODEL", "ollama/qwen2.5:32b-instruct-q5_K_M")

Default Models

Cloud: gemini/gemini-2.0-flash (via LiteLLM)
Local: qwen2.5:32b-instruct-q5_K_M (~23GB download)

Setup

Install Ollama

Download and install Ollama from ollama.com

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or download from https://ollama.com

Pull the Default Model

Download the Qwen 2.5 32B model (default for Hinbox):

ollama pull qwen2.5:32b-instruct-q5_K_M

This downloads ~23GB. Qwen 2.5 32B provides excellent extraction quality while fitting in 24GB+ VRAM systems.

Configure Context Window

Ollama defaults to conservative context windows. Qwen 2.5 supports up to 131K tokens:

# Add to your shell profile or systemd unit
export OLLAMA_CONTEXT_LENGTH=32768

Higher context uses more VRAM. Start with 32768 and adjust based on your GPU memory.

Verify Ollama is Running

The Ollama server should auto-start. Test it:

curl http://localhost:11434/v1/models

Expected output: JSON list of available models

Running with Local Models

Use the --local flag to switch to Ollama:

just process --domain guantanamo --limit 5 --local

This automatically:

Uses the HINBOX_OLLAMA_MODEL for extraction
Switches to local embeddings
Disables all LLM telemetry (see Privacy Mode)

API Configuration

By default, Hinbox connects to http://localhost:11434/v1. Override via environment variable:

# .env file
OLLAMA_API_URL=http://192.168.1.100:11434/v1

From src/constants.py:15:

OLLAMA_API_URL = os.getenv("OLLAMA_API_URL", "http://localhost:11434/v1").strip()

Switching Models

Using a Different Ollama Model

Override the default without editing code:

# .env file
HINBOX_OLLAMA_MODEL=ollama/llama3.1:70b

Then pull the new model:

ollama pull llama3.1:70b

Supported Model Families

Any Ollama-compatible model works via LiteLLM integration. Popular choices:

Qwen 2.5 32B

Default. Excellent quality, fits 24GB VRAM

ollama pull qwen2.5:32b-instruct-q5_K_M

Llama 3.1 70B

Highest quality, requires 48GB+ VRAM

ollama pull llama3.1:70b

Gemma 2 27B

Fast and efficient, 16GB+ VRAM

ollama pull gemma2:27b

Mistral Large

Balanced performance, 32GB VRAM

ollama pull mistral-large

Concurrency Settings

Local models have different concurrency limits than cloud APIs. From configs/guantanamo/config.yaml:50-55:

performance:
  concurrency:
    extract_workers: 8        # parallel articles in extraction phase
    extract_per_article: 4    # parallel entity types within article
    llm_in_flight: 16         # max concurrent cloud LLM calls
    ollama_in_flight: 2       # max concurrent Ollama calls (local mode)

Local mode uses ollama_in_flight: 2 to avoid overloading GPU memory. Adjust based on your VRAM:

16GB VRAM: ollama_in_flight: 1
24GB VRAM: ollama_in_flight: 2 (default)
48GB+ VRAM: ollama_in_flight: 4

The concurrency settings are loaded from src/process_and_extract.py:798-809:

cc = config.get_concurrency_config()
configure_llm_concurrency(
    cloud_in_flight=cc["llm_in_flight"] if model_type == "gemini" else None,
    local_in_flight=cc["ollama_in_flight"] if model_type == "ollama" else None,
)

Model Name Handling

Ollama API expects bare model names without the ollama/ prefix. Hinbox automatically strips it via src/constants.py:20-22:

def get_ollama_model_name(model: str) -> str:
    """Strip 'ollama/' prefix if present for Ollama API calls."""
    return model.replace("ollama/", "") if model.startswith("ollama/") else model

This means:

Config: HINBOX_OLLAMA_MODEL=ollama/qwen2.5:32b-instruct-q5_K_M
API call: qwen2.5:32b-instruct-q5_K_M

Performance Considerations

GPU Memory Requirements

Model Size	Quantization	VRAM Required	Typical Speed
7B	Q5_K_M	6GB	~50 tok/s
13B	Q5_K_M	10GB	~30 tok/s
32B	Q5_K_M	24GB	~20 tok/s
70B	Q4_K_M	48GB	~10 tok/s

CPU-Only Mode

Ollama can run on CPU, but extraction will be 10-50x slower:

# Force CPU mode
OLLAMA_DEVICE=cpu ollama serve

CPU mode is only practical for small test runs (--limit 5). GPU is strongly recommended for production use.

Troubleshooting

Connection Refused

Error: Connection refused to http://localhost:11434/v1

Solution: Verify Ollama is running:

pgrep ollama  # Should show a process ID
ollama list   # Should list downloaded models

Restart if needed:

ollama serve

Out of Memory

Error: CUDA out of memory

Solutions:

Reduce context window: export OLLAMA_CONTEXT_LENGTH=8192
Lower concurrency: Set ollama_in_flight: 1 in config.yaml
Use smaller model: Switch to qwen2.5:14b or gemma2:9b

Slow Extraction

Check GPU utilization:

nvidia-smi  # Should show ~90%+ GPU usage during extraction

If GPU usage is low:

Increase ollama_in_flight (if you have VRAM headroom)
Verify context window isn’t too large
Check CPU bottlenecks with htop

Cloud vs. Local Comparison

Aspect	Cloud (Gemini)	Local (Ollama)
Cost	~$0.50-2 per 1000 docs	Free (electricity only)
Privacy	Data sent to Google	100% local
Speed	~2-5 docs/sec	~0.5-2 docs/sec (GPU)
Setup	API key only	GPU, model download
Quality	Excellent	Excellent (32B+ models)
Internet	Required	Not required

Next Steps

Privacy Mode

Learn how --local enforces complete data privacy

Performance Tuning

Optimize concurrency and batching settings

Quality Controls

Understand extraction QC and retry logic

Caching

Avoid redundant LLM calls with extraction cache

Get Started

Core Concepts

Guides

Advanced

Using Local Models

Overview

Model Configuration

Default Models

Setup

Running with Local Models

API Configuration

Switching Models

Using a Different Ollama Model

Supported Model Families

Qwen 2.5 32B

Llama 3.1 70B

Gemma 2 27B

Mistral Large

Concurrency Settings

Model Name Handling

Performance Considerations

GPU Memory Requirements

CPU-Only Mode

Troubleshooting

Connection Refused

Out of Memory

Slow Extraction

Cloud vs. Local Comparison

Next Steps

Privacy Mode

Performance Tuning

Quality Controls

Caching

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Overview

​Model Configuration

​Default Models

​Setup

​Running with Local Models

​API Configuration

​Switching Models

​Using a Different Ollama Model

​Supported Model Families

Qwen 2.5 32B

Llama 3.1 70B

Gemma 2 27B

Mistral Large

​Concurrency Settings

​Model Name Handling

​Performance Considerations

​GPU Memory Requirements

​CPU-Only Mode

​Troubleshooting

​Connection Refused

​Out of Memory

​Slow Extraction

​Cloud vs. Local Comparison

​Next Steps

Privacy Mode

Performance Tuning

Quality Controls

Caching

Build docs developers (and LLMs) love

Overview

Model Configuration

Default Models

Setup

Running with Local Models

API Configuration

Switching Models

Using a Different Ollama Model

Supported Model Families

Concurrency Settings

Model Name Handling

Performance Considerations

GPU Memory Requirements

CPU-Only Mode

Troubleshooting

Connection Refused

Out of Memory

Slow Extraction

Cloud vs. Local Comparison

Next Steps