Local LLM Setup (MedGemma)

ClinicalPilot supports running local medical LLMs via Ollama, eliminating the need for OpenAI API calls while maintaining clinical accuracy.

Why Local LLMs?

Privacy: All processing happens on your machine
Cost: No per-token API charges
Offline: Works without internet connectivity
Compliance: Easier HIPAA compliance (no third-party data transmission)

Local LLMs require significant computational resources. MedGemma-2 9B needs 16GB+ RAM, while the 27B variant requires 32GB+ RAM.

Supported Models

ClinicalPilot is optimized for MedGemma-2, Google’s medical language model:

medgemma2:9b — 9 billion parameters (~5GB download, 16GB RAM)
medgemma2:27b — 27 billion parameters (~15GB download, 32GB+ RAM)

Installation

Install Ollama

Download and install Ollama for your platform:

brew install ollama

Start Ollama Server

ollama serve

This starts the Ollama API server on http://localhost:11434.

On macOS, Ollama may start automatically as a background service after installation.

Pull MedGemma Model

Download the model weights:

# 9B version (recommended for most users)
ollama pull medgemma2:9b

# 27B version (requires 32GB+ RAM)
ollama pull medgemma2:27b

First pull will take several minutes depending on your connection speed.

Configure Environment

Update your .env file:

USE_LOCAL_LLM=true
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=medgemma2:9b

Verify Setup

Test the connection:

curl http://localhost:11434/api/tags

You should see a JSON response listing your installed models.

Configuration Options

Environment Variable	Default	Description
`USE_LOCAL_LLM`	`false`	Enable local LLM instead of OpenAI
`OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama API endpoint
`OLLAMA_MODEL`	`medgemma2:9b`	Model to use for clinical reasoning

Fallback Behavior

If Ollama is unreachable or the model fails to load, ClinicalPilot automatically falls back to OpenAI (GPT-4o). This ensures the system remains operational even if the local LLM is unavailable.

Check logs for fallback events:

python -m uvicorn backend.main:app --log-level=info

Look for:

WARNING: Ollama connection failed, falling back to OpenAI

Performance Comparison

Model	Latency (avg)	RAM Usage	Quality
GPT-4o (API)	8-12s	N/A	Excellent
MedGemma-2 9B	15-25s	~10GB	Very Good
MedGemma-2 27B	30-45s	~20GB	Excellent

Local LLMs are slower than cloud APIs. The full debate pipeline (3 rounds × 4 agents) may take 3-5 minutes with MedGemma-2 9B, compared to ~100s with GPT-4o.

Hardware Acceleration

Apple Silicon (M1/M2/M3)

Ollama uses Metal for GPU acceleration on Apple Silicon. No additional configuration needed.

NVIDIA GPUs (Linux/Windows)

Ollama automatically detects and uses NVIDIA GPUs with CUDA support. Verify GPU usage:

nvidia-smi

CPU-Only Mode

Ollama works on CPU-only systems but will be significantly slower (2-3x latency).

Troubleshooting

”Connection refused” Error

# Check if Ollama is running
ps aux | grep ollama

# Restart the service
ollama serve

Model Not Found

# List installed models
ollama list

# Pull the model if missing
ollama pull medgemma2:9b

Out of Memory

Reduce model size or close other applications:

# Switch to smaller model
OLLAMA_MODEL=medgemma2:9b  # instead of 27b

Slow Performance

First request is always slower (20-30s) as the model loads into memory. Subsequent requests are faster (10-15s).

For production, consider:

Using a smaller model (9B vs 27B)
Increasing system RAM
Using GPU acceleration
Hybrid mode: local LLM for non-critical agents, OpenAI for Clinical agent

Hybrid Configuration

You can use local LLMs for some agents and cloud APIs for others by modifying backend/config.py:

# Example: Use MedGemma for Literature agent, GPT-4o for Clinical
if agent_type == "literature":
    use_local_llm = True
else:
    use_local_llm = False

Security Considerations

Local LLMs never send data to external servers. All processing happens on your machine, making them ideal for HIPAA-compliant deployments where PHI cannot leave the organization’s network.

Advanced

Deployment

Local LLM Setup (MedGemma)

Why Local LLMs?

Supported Models

Installation

Configuration Options

Fallback Behavior

Performance Comparison

Hardware Acceleration

Apple Silicon (M1/M2/M3)

NVIDIA GPUs (Linux/Windows)

CPU-Only Mode

Troubleshooting

”Connection refused” Error

Model Not Found

Out of Memory

Slow Performance

Hybrid Configuration

Security Considerations

Next Steps

RAG Setup

Observability

Build docs developers (and LLMs) love

Advanced

Deployment

​Why Local LLMs?

​Supported Models

​Installation

​Configuration Options

​Fallback Behavior

​Performance Comparison

​Hardware Acceleration

​Apple Silicon (M1/M2/M3)

​NVIDIA GPUs (Linux/Windows)

​CPU-Only Mode

​Troubleshooting

​”Connection refused” Error

​Model Not Found

​Out of Memory

​Slow Performance

​Hybrid Configuration

​Security Considerations

​Next Steps

RAG Setup

Observability

Build docs developers (and LLMs) love

Why Local LLMs?

Supported Models

Installation

Configuration Options

Fallback Behavior

Performance Comparison

Hardware Acceleration

Apple Silicon (M1/M2/M3)

NVIDIA GPUs (Linux/Windows)

CPU-Only Mode

Troubleshooting

”Connection refused” Error

Model Not Found

Out of Memory

Slow Performance

Hybrid Configuration

Security Considerations

Next Steps