Ollama Provider

Overview

The Ollama Provider enables LLM Gateway Core to connect to a local Ollama instance, providing:

Privacy: All data stays on your infrastructure
Cost: No API charges for inference
Flexibility: Support for multiple open-source models
Control: Full control over model versions and configurations

Features

HTTP API integration with local Ollama server
Dynamic model selection
Token usage tracking
Configurable timeouts
Default fallback to llama3.1

Prerequisites

Install Ollama

Download and install Ollama from ollama.ai

# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh

Pull a Model

Download a model to use with the gateway:

ollama pull llama3.1

Other popular models:

ollama pull llama3.1:70b - Larger, more capable
ollama pull mistral - Fast and efficient
ollama pull codellama - Optimized for code

Start Ollama Server

Ensure Ollama is running:

ollama serve

By default, Ollama listens on http://localhost:11434

Configuration

Environment Variables

Configure the Ollama provider in your .env file:

.env

OLLAMA_BASE_URL=http://localhost:11434
PROVIDER_TIMEOUT_SECONDS=60

Configuration Settings

app/core/config.py

class Settings(BaseSettings):
    OLLAMA_BASE_URL: str = "http://localhost:11434"
    PROVIDER_TIMEOUT_SECONDS: int = 60
    PROVIDER_MAX_RETRIES: int = 3

Local Development
Docker Compose
Remote Server

OLLAMA_BASE_URL=http://localhost:11434

OLLAMA_BASE_URL=http://ollama:11434

Use the service name if Ollama runs in the same Docker network.

OLLAMA_BASE_URL=http://192.168.1.100:11434

Point to any networked Ollama instance.

Implementation

Source Code

Here’s the complete implementation of the Ollama provider:

app/providers/ollama.py

import httpx
from app.providers.base import LLMProvider
from app.api.v1.schemas import ChatRequest, ChatResponse, Usage
from app.core.config import settings
import uuid

class OllamaProvider(LLMProvider):
    @property
    def name(self) -> str:
        return "ollama"
    
    async def chat(self, request: ChatRequest) -> ChatResponse:
        """
        Execute a chat completion request against a local Ollama instance.
        """
        url = f"{settings.OLLAMA_BASE_URL}/api/chat"
        
        # Convert ChatRequest messages to Ollama format
        ollama_messages = [
            {"role": msg.role, "content": msg.content} 
            for msg in request.messages
        ]
        
        # Determine target model
        target_model = request.model if request.model not in ["ollama", "local", None] else "llama3.1"
        
        payload = {
            "model": target_model,
            "messages": ollama_messages,
            "stream": False
        }
        
        async with httpx.AsyncClient(timeout=settings.PROVIDER_TIMEOUT_SECONDS) as client:
            try:
                response = await client.post(url, json=payload)
                response.raise_for_status()
                data = response.json()
                
                return ChatResponse(
                    id=str(uuid.uuid4()),
                    provider=self.name,
                    content=data["message"]["content"],
                    usage=Usage(
                        prompt_tokens=data.get("prompt_eval_count", 0),
                        completion_tokens=data.get("eval_count", 0),
                        total_tokens=data.get("prompt_eval_count", 0) + data.get("eval_count", 0)
                    )
                )
            except Exception as e:
                print(f"[OLLAMA ERROR] {e}")
                raise

Key Implementation Details

Model Selection Logic

The provider uses intelligent model selection:

target_model = request.model if request.model not in ["ollama", "local", None] else "llama3.1"

If request.model is a specific model name (e.g., "mistral"), use it
If request.model is generic ("ollama", "local", None), default to "llama3.1"
This allows clients to specify exact models or use routing hints

Message Format

Ollama’s API uses the same message format as OpenAI:

{
    "role": "user" | "assistant" | "system",
    "content": "message text"
}

No conversion is needed - messages are passed through directly.

Token Tracking

Ollama provides actual token counts:

prompt_eval_count - Tokens in the prompt
eval_count - Tokens generated
Both are included in the response for accurate usage tracking

HTTP Client

Uses httpx.AsyncClient for async HTTP requests with:

Configurable timeout via PROVIDER_TIMEOUT_SECONDS
Automatic connection pooling
HTTP status error raising with raise_for_status()

Usage

Routing to Ollama

The gateway routes requests to Ollama when:

Model Hint
Explicit Model
Specific Model
Default

{
  "model_hint": "local",
  "messages": [...]
}

Also accepts "ollama" or "secure" as hints.

{
  "model": "ollama",
  "messages": [...]
}

{
  "model": "mistral",
  "messages": [...]
}

Use any model name available in your Ollama instance.

Example Request

curl -X POST http://localhost:8000/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-gateway-123" \
  -d '{
    "model": "llama3.1",
    "messages": [
      {"role": "user", "content": "Explain quantum computing"}
    ],
    "temperature": 0.7,
    "max_tokens": 512
  }'

Example Response

{
  "id": "b4c9d0e3-f5a6-7b8c-9d0e-1f2a3b4c5d6e",
  "provider": "ollama",
  "content": "Quantum computing is a revolutionary approach to computation that leverages the principles of quantum mechanics...",
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 248,
    "total_tokens": 263
  }
}

Available Models

Recommended Models

llama3.1

Size: 8B parameters (default)Best For: General conversation, Q&ASpeed: Fast

llama3.1:70b

Size: 70B parametersBest For: Complex reasoning, detailed responsesSpeed: Slower, requires more resources

mistral

Size: 7B parametersBest For: Efficient, fast inferenceSpeed: Very fast

codellama

Size: 7B-34B parametersBest For: Code generation and explanationSpeed: Fast

Installing Models

# List available models
ollama list

# Pull a new model
ollama pull mistral

# Remove a model
ollama rm llama3.1:70b

Error Handling

Common Errors

Connection Refused

Error: httpx.ConnectError: [Errno 61] Connection refusedCause: Ollama server is not runningSolution:

ollama serve

Model Not Found

Error: model 'model-name' not foundCause: Requested model is not installedSolution:

ollama pull model-name

Timeout

Error: httpx.TimeoutExceptionCause: Request exceeded timeout limitSolution: Increase timeout in .env:

PROVIDER_TIMEOUT_SECONDS=120

Invalid Base URL

Error: Invalid URLCause: Malformed OLLAMA_BASE_URLSolution: Ensure URL includes protocol:

OLLAMA_BASE_URL=http://localhost:11434

Performance Tuning

Resource Requirements

Model performance depends heavily on your hardware:

8B models: 8GB+ RAM recommended
13B models: 16GB+ RAM recommended
70B models: 64GB+ RAM or GPU required

Optimization Tips

Use GPU Acceleration

Ollama automatically uses GPU if available (NVIDIA, AMD, Apple Metal)

Adjust Context Window

Reduce context size for faster responses:

ollama run llama3.1 --context-length 2048

Pre-load Models

Keep models in memory:

ollama run llama3.1 "" # Loads model without prompting

Docker Deployment

Running Ollama in Docker alongside the gateway:

docker-compose.yml

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
  
  gateway:
    build: .
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

volumes:
  ollama_data:

GPU support in Docker requires nvidia-docker runtime for NVIDIA GPUs.

Next Steps

Gemini Provider

Compare with cloud-based Gemini

Custom Providers

Build your own provider

Router Configuration

Configure intelligent routing

Deployment

Production deployment guide

Get Started

Core Concepts

Providers

Observability

Deployment

Overview

Features

Prerequisites

Configuration

Environment Variables

Configuration Settings

Implementation

Source Code

Key Implementation Details

Usage

Routing to Ollama

Example Request

Example Response

Available Models

Recommended Models

llama3.1

llama3.1:70b

mistral

codellama

Installing Models

Error Handling

Common Errors

Performance Tuning

Resource Requirements

Optimization Tips

Docker Deployment

Next Steps

Gemini Provider

Custom Providers

Router Configuration

Deployment

Build docs developers (and LLMs) love

Get Started

Core Concepts

Providers

Observability

Deployment

​Overview

​Features

​Prerequisites

​Configuration

​Environment Variables

​Configuration Settings

​Implementation

​Source Code

​Key Implementation Details

​Usage

​Routing to Ollama

​Example Request

​Example Response

​Available Models

​Recommended Models

llama3.1

llama3.1:70b

mistral

codellama

​Installing Models

​Error Handling

​Common Errors

​Performance Tuning

​Resource Requirements

​Optimization Tips

​Docker Deployment

​Next Steps

Gemini Provider

Custom Providers

Router Configuration

Deployment

Build docs developers (and LLMs) love

Overview

Features

Prerequisites

Configuration

Environment Variables

Configuration Settings

Implementation

Source Code

Key Implementation Details

Usage

Routing to Ollama

Example Request

Example Response

Available Models

Recommended Models

Installing Models

Error Handling

Common Errors

Performance Tuning

Resource Requirements

Optimization Tips

Docker Deployment

Next Steps