Why Local LLMs?
- Privacy: All processing happens on your machine
- Cost: No per-token API charges
- Offline: Works without internet connectivity
- Compliance: Easier HIPAA compliance (no third-party data transmission)
Supported Models
ClinicalPilot is optimized for MedGemma-2, Google’s medical language model:- medgemma2:9b — 9 billion parameters (~5GB download, 16GB RAM)
- medgemma2:27b — 27 billion parameters (~15GB download, 32GB+ RAM)
Installation
Start Ollama Server
http://localhost:11434.On macOS, Ollama may start automatically as a background service after installation.
Pull MedGemma Model
Download the model weights:First pull will take several minutes depending on your connection speed.
Configuration Options
| Environment Variable | Default | Description |
|---|---|---|
USE_LOCAL_LLM | false | Enable local LLM instead of OpenAI |
OLLAMA_BASE_URL | http://localhost:11434 | Ollama API endpoint |
OLLAMA_MODEL | medgemma2:9b | Model to use for clinical reasoning |
Fallback Behavior
If Ollama is unreachable or the model fails to load, ClinicalPilot automatically falls back to OpenAI (GPT-4o). This ensures the system remains operational even if the local LLM is unavailable.
Performance Comparison
| Model | Latency (avg) | RAM Usage | Quality |
|---|---|---|---|
| GPT-4o (API) | 8-12s | N/A | Excellent |
| MedGemma-2 9B | 15-25s | ~10GB | Very Good |
| MedGemma-2 27B | 30-45s | ~20GB | Excellent |
Hardware Acceleration
Apple Silicon (M1/M2/M3)
Ollama uses Metal for GPU acceleration on Apple Silicon. No additional configuration needed.NVIDIA GPUs (Linux/Windows)
Ollama automatically detects and uses NVIDIA GPUs with CUDA support. Verify GPU usage:CPU-Only Mode
Ollama works on CPU-only systems but will be significantly slower (2-3x latency).Troubleshooting
”Connection refused” Error
Model Not Found
Out of Memory
Reduce model size or close other applications:Slow Performance
First request is always slower (20-30s) as the model loads into memory. Subsequent requests are faster (10-15s).
- Using a smaller model (9B vs 27B)
- Increasing system RAM
- Using GPU acceleration
- Hybrid mode: local LLM for non-critical agents, OpenAI for Clinical agent
Hybrid Configuration
You can use local LLMs for some agents and cloud APIs for others by modifyingbackend/config.py:
Security Considerations
Local LLMs never send data to external servers. All processing happens on your machine, making them ideal for HIPAA-compliant deployments where PHI cannot leave the organization’s network.
Next Steps
RAG Setup
Configure the LanceDB vector store for medical literature retrieval
Observability
Set up LangSmith tracing to monitor agent performance