Skip to main content

Overview

The Ollama Provider enables LLM Gateway Core to connect to a local Ollama instance, providing:
  • Privacy: All data stays on your infrastructure
  • Cost: No API charges for inference
  • Flexibility: Support for multiple open-source models
  • Control: Full control over model versions and configurations

Features

  • HTTP API integration with local Ollama server
  • Dynamic model selection
  • Token usage tracking
  • Configurable timeouts
  • Default fallback to llama3.1

Prerequisites

1

Install Ollama

Download and install Ollama from ollama.ai
# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh
2

Pull a Model

Download a model to use with the gateway:
ollama pull llama3.1
Other popular models:
  • ollama pull llama3.1:70b - Larger, more capable
  • ollama pull mistral - Fast and efficient
  • ollama pull codellama - Optimized for code
3

Start Ollama Server

Ensure Ollama is running:
ollama serve
By default, Ollama listens on http://localhost:11434

Configuration

Environment Variables

Configure the Ollama provider in your .env file:
.env
OLLAMA_BASE_URL=http://localhost:11434
PROVIDER_TIMEOUT_SECONDS=60

Configuration Settings

app/core/config.py
class Settings(BaseSettings):
    OLLAMA_BASE_URL: str = "http://localhost:11434"
    PROVIDER_TIMEOUT_SECONDS: int = 60
    PROVIDER_MAX_RETRIES: int = 3
OLLAMA_BASE_URL=http://localhost:11434

Implementation

Source Code

Here’s the complete implementation of the Ollama provider:
app/providers/ollama.py
import httpx
from app.providers.base import LLMProvider
from app.api.v1.schemas import ChatRequest, ChatResponse, Usage
from app.core.config import settings
import uuid

class OllamaProvider(LLMProvider):
    @property
    def name(self) -> str:
        return "ollama"
    
    async def chat(self, request: ChatRequest) -> ChatResponse:
        """
        Execute a chat completion request against a local Ollama instance.
        """
        url = f"{settings.OLLAMA_BASE_URL}/api/chat"
        
        # Convert ChatRequest messages to Ollama format
        ollama_messages = [
            {"role": msg.role, "content": msg.content} 
            for msg in request.messages
        ]
        
        # Determine target model
        target_model = request.model if request.model not in ["ollama", "local", None] else "llama3.1"
        
        payload = {
            "model": target_model,
            "messages": ollama_messages,
            "stream": False
        }
        
        async with httpx.AsyncClient(timeout=settings.PROVIDER_TIMEOUT_SECONDS) as client:
            try:
                response = await client.post(url, json=payload)
                response.raise_for_status()
                data = response.json()
                
                return ChatResponse(
                    id=str(uuid.uuid4()),
                    provider=self.name,
                    content=data["message"]["content"],
                    usage=Usage(
                        prompt_tokens=data.get("prompt_eval_count", 0),
                        completion_tokens=data.get("eval_count", 0),
                        total_tokens=data.get("prompt_eval_count", 0) + data.get("eval_count", 0)
                    )
                )
            except Exception as e:
                print(f"[OLLAMA ERROR] {e}")
                raise

Key Implementation Details

The provider uses intelligent model selection:
target_model = request.model if request.model not in ["ollama", "local", None] else "llama3.1"
  • If request.model is a specific model name (e.g., "mistral"), use it
  • If request.model is generic ("ollama", "local", None), default to "llama3.1"
  • This allows clients to specify exact models or use routing hints
Ollama’s API uses the same message format as OpenAI:
{
    "role": "user" | "assistant" | "system",
    "content": "message text"
}
No conversion is needed - messages are passed through directly.
Ollama provides actual token counts:
  • prompt_eval_count - Tokens in the prompt
  • eval_count - Tokens generated
  • Both are included in the response for accurate usage tracking
Uses httpx.AsyncClient for async HTTP requests with:
  • Configurable timeout via PROVIDER_TIMEOUT_SECONDS
  • Automatic connection pooling
  • HTTP status error raising with raise_for_status()

Usage

Routing to Ollama

The gateway routes requests to Ollama when:
{
  "model_hint": "local",
  "messages": [...]
}
Also accepts "ollama" or "secure" as hints.

Example Request

curl -X POST http://localhost:8000/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-gateway-123" \
  -d '{
    "model": "llama3.1",
    "messages": [
      {"role": "user", "content": "Explain quantum computing"}
    ],
    "temperature": 0.7,
    "max_tokens": 512
  }'

Example Response

{
  "id": "b4c9d0e3-f5a6-7b8c-9d0e-1f2a3b4c5d6e",
  "provider": "ollama",
  "content": "Quantum computing is a revolutionary approach to computation that leverages the principles of quantum mechanics...",
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 248,
    "total_tokens": 263
  }
}

Available Models

llama3.1

Size: 8B parameters (default)Best For: General conversation, Q&ASpeed: Fast

llama3.1:70b

Size: 70B parametersBest For: Complex reasoning, detailed responsesSpeed: Slower, requires more resources

mistral

Size: 7B parametersBest For: Efficient, fast inferenceSpeed: Very fast

codellama

Size: 7B-34B parametersBest For: Code generation and explanationSpeed: Fast

Installing Models

# List available models
ollama list

# Pull a new model
ollama pull mistral

# Remove a model
ollama rm llama3.1:70b

Error Handling

Common Errors

Error: httpx.ConnectError: [Errno 61] Connection refusedCause: Ollama server is not runningSolution:
ollama serve
Error: model 'model-name' not foundCause: Requested model is not installedSolution:
ollama pull model-name
Error: httpx.TimeoutExceptionCause: Request exceeded timeout limitSolution: Increase timeout in .env:
PROVIDER_TIMEOUT_SECONDS=120
Error: Invalid URLCause: Malformed OLLAMA_BASE_URLSolution: Ensure URL includes protocol:
OLLAMA_BASE_URL=http://localhost:11434

Performance Tuning

Resource Requirements

Model performance depends heavily on your hardware:
  • 8B models: 8GB+ RAM recommended
  • 13B models: 16GB+ RAM recommended
  • 70B models: 64GB+ RAM or GPU required

Optimization Tips

1

Use GPU Acceleration

Ollama automatically uses GPU if available (NVIDIA, AMD, Apple Metal)
2

Adjust Context Window

Reduce context size for faster responses:
ollama run llama3.1 --context-length 2048
3

Pre-load Models

Keep models in memory:
ollama run llama3.1 "" # Loads model without prompting

Docker Deployment

Running Ollama in Docker alongside the gateway:
docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
  
  gateway:
    build: .
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

volumes:
  ollama_data:
GPU support in Docker requires nvidia-docker runtime for NVIDIA GPUs.

Next Steps

Gemini Provider

Compare with cloud-based Gemini

Custom Providers

Build your own provider

Router Configuration

Configure intelligent routing

Deployment

Production deployment guide

Build docs developers (and LLMs) love