Overview
The Ollama Provider enables LLM Gateway Core to connect to a local Ollama instance, providing:- Privacy: All data stays on your infrastructure
- Cost: No API charges for inference
- Flexibility: Support for multiple open-source models
- Control: Full control over model versions and configurations
Features
- HTTP API integration with local Ollama server
- Dynamic model selection
- Token usage tracking
- Configurable timeouts
- Default fallback to
llama3.1
Prerequisites
Install Ollama
Download and install Ollama from ollama.ai
Pull a Model
Download a model to use with the gateway:Other popular models:
ollama pull llama3.1:70b- Larger, more capableollama pull mistral- Fast and efficientollama pull codellama- Optimized for code
Configuration
Environment Variables
Configure the Ollama provider in your.env file:
.env
Configuration Settings
app/core/config.py
- Local Development
- Docker Compose
- Remote Server
Implementation
Source Code
Here’s the complete implementation of the Ollama provider:app/providers/ollama.py
Key Implementation Details
Model Selection Logic
Model Selection Logic
The provider uses intelligent model selection:
- If
request.modelis a specific model name (e.g.,"mistral"), use it - If
request.modelis generic ("ollama","local",None), default to"llama3.1" - This allows clients to specify exact models or use routing hints
Message Format
Message Format
Ollama’s API uses the same message format as OpenAI:No conversion is needed - messages are passed through directly.
Token Tracking
Token Tracking
Ollama provides actual token counts:
prompt_eval_count- Tokens in the prompteval_count- Tokens generated- Both are included in the response for accurate usage tracking
HTTP Client
HTTP Client
Uses
httpx.AsyncClient for async HTTP requests with:- Configurable timeout via
PROVIDER_TIMEOUT_SECONDS - Automatic connection pooling
- HTTP status error raising with
raise_for_status()
Usage
Routing to Ollama
The gateway routes requests to Ollama when:- Model Hint
- Explicit Model
- Specific Model
- Default
"ollama" or "secure" as hints.Example Request
Example Response
Available Models
Recommended Models
llama3.1
Size: 8B parameters (default)Best For: General conversation, Q&ASpeed: Fast
llama3.1:70b
Size: 70B parametersBest For: Complex reasoning, detailed responsesSpeed: Slower, requires more resources
mistral
Size: 7B parametersBest For: Efficient, fast inferenceSpeed: Very fast
codellama
Size: 7B-34B parametersBest For: Code generation and explanationSpeed: Fast
Installing Models
Error Handling
Common Errors
Connection Refused
Connection Refused
Error:
httpx.ConnectError: [Errno 61] Connection refusedCause: Ollama server is not runningSolution:Model Not Found
Model Not Found
Error:
model 'model-name' not foundCause: Requested model is not installedSolution:Timeout
Timeout
Error:
httpx.TimeoutExceptionCause: Request exceeded timeout limitSolution: Increase timeout in .env:Invalid Base URL
Invalid Base URL
Error:
Invalid URLCause: Malformed OLLAMA_BASE_URLSolution: Ensure URL includes protocol:Performance Tuning
Resource Requirements
Model performance depends heavily on your hardware:
- 8B models: 8GB+ RAM recommended
- 13B models: 16GB+ RAM recommended
- 70B models: 64GB+ RAM or GPU required
Optimization Tips
Docker Deployment
Running Ollama in Docker alongside the gateway:docker-compose.yml
Next Steps
Gemini Provider
Compare with cloud-based Gemini
Custom Providers
Build your own provider
Router Configuration
Configure intelligent routing
Deployment
Production deployment guide