Skip to main content
ClinicalPilot supports running local medical LLMs via Ollama, eliminating the need for OpenAI API calls while maintaining clinical accuracy.

Why Local LLMs?

  • Privacy: All processing happens on your machine
  • Cost: No per-token API charges
  • Offline: Works without internet connectivity
  • Compliance: Easier HIPAA compliance (no third-party data transmission)
Local LLMs require significant computational resources. MedGemma-2 9B needs 16GB+ RAM, while the 27B variant requires 32GB+ RAM.

Supported Models

ClinicalPilot is optimized for MedGemma-2, Google’s medical language model:
  • medgemma2:9b — 9 billion parameters (~5GB download, 16GB RAM)
  • medgemma2:27b — 27 billion parameters (~15GB download, 32GB+ RAM)

Installation

1

Install Ollama

Download and install Ollama for your platform:
brew install ollama
2

Start Ollama Server

ollama serve
This starts the Ollama API server on http://localhost:11434.
On macOS, Ollama may start automatically as a background service after installation.
3

Pull MedGemma Model

Download the model weights:
# 9B version (recommended for most users)
ollama pull medgemma2:9b

# 27B version (requires 32GB+ RAM)
ollama pull medgemma2:27b
First pull will take several minutes depending on your connection speed.
4

Configure Environment

Update your .env file:
USE_LOCAL_LLM=true
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=medgemma2:9b
5

Verify Setup

Test the connection:
curl http://localhost:11434/api/tags
You should see a JSON response listing your installed models.

Configuration Options

Environment VariableDefaultDescription
USE_LOCAL_LLMfalseEnable local LLM instead of OpenAI
OLLAMA_BASE_URLhttp://localhost:11434Ollama API endpoint
OLLAMA_MODELmedgemma2:9bModel to use for clinical reasoning

Fallback Behavior

If Ollama is unreachable or the model fails to load, ClinicalPilot automatically falls back to OpenAI (GPT-4o). This ensures the system remains operational even if the local LLM is unavailable.
Check logs for fallback events:
python -m uvicorn backend.main:app --log-level=info
Look for:
WARNING: Ollama connection failed, falling back to OpenAI

Performance Comparison

ModelLatency (avg)RAM UsageQuality
GPT-4o (API)8-12sN/AExcellent
MedGemma-2 9B15-25s~10GBVery Good
MedGemma-2 27B30-45s~20GBExcellent
Local LLMs are slower than cloud APIs. The full debate pipeline (3 rounds × 4 agents) may take 3-5 minutes with MedGemma-2 9B, compared to ~100s with GPT-4o.

Hardware Acceleration

Apple Silicon (M1/M2/M3)

Ollama uses Metal for GPU acceleration on Apple Silicon. No additional configuration needed.

NVIDIA GPUs (Linux/Windows)

Ollama automatically detects and uses NVIDIA GPUs with CUDA support. Verify GPU usage:
nvidia-smi

CPU-Only Mode

Ollama works on CPU-only systems but will be significantly slower (2-3x latency).

Troubleshooting

”Connection refused” Error

# Check if Ollama is running
ps aux | grep ollama

# Restart the service
ollama serve

Model Not Found

# List installed models
ollama list

# Pull the model if missing
ollama pull medgemma2:9b

Out of Memory

Reduce model size or close other applications:
# Switch to smaller model
OLLAMA_MODEL=medgemma2:9b  # instead of 27b

Slow Performance

First request is always slower (20-30s) as the model loads into memory. Subsequent requests are faster (10-15s).
For production, consider:
  • Using a smaller model (9B vs 27B)
  • Increasing system RAM
  • Using GPU acceleration
  • Hybrid mode: local LLM for non-critical agents, OpenAI for Clinical agent

Hybrid Configuration

You can use local LLMs for some agents and cloud APIs for others by modifying backend/config.py:
# Example: Use MedGemma for Literature agent, GPT-4o for Clinical
if agent_type == "literature":
    use_local_llm = True
else:
    use_local_llm = False

Security Considerations

Local LLMs never send data to external servers. All processing happens on your machine, making them ideal for HIPAA-compliant deployments where PHI cannot leave the organization’s network.

Next Steps

RAG Setup

Configure the LanceDB vector store for medical literature retrieval

Observability

Set up LangSmith tracing to monitor agent performance

Build docs developers (and LLMs) love