Ollama

Overview

Ollama lets you run large language models locally. LiteLLM provides seamless integration with Ollama, supporting chat, embeddings, function calling, and reasoning models.

Quick Start

Install Ollama

Download and install Ollama from ollama.ai

# Pull a model
ollama pull llama3.3

Install LiteLLM

pip install litellm

Make Your First Call

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="http://localhost:11434"
)
print(response.choices[0].message.content)

Popular Models

Llama
Mistral
Phi
Code Models

Meta’s Llama models.

# Pull models
ollama pull llama3.3
ollama pull llama3.1

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Explain AI"}],
    api_base="http://localhost:11434"
)

Mistral AI models.

ollama pull mistral
ollama pull mixtral

response = completion(
    model="ollama/mistral",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="http://localhost:11434"
)

Microsoft’s Phi models.

ollama pull phi3

response = completion(
    model="ollama/phi3",
    messages=[{"role": "user", "content": "Quick task"}],
    api_base="http://localhost:11434"
)

Code-specialized models.

ollama pull codellama
ollama pull deepseek-coder

response = completion(
    model="ollama/deepseek-coder",
    messages=[{"role": "user", "content": "Write a Python function"}],
    api_base="http://localhost:11434"
)

Configuration

Default Localhost
Custom Host
Environment Variable

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Hello!"}]
    # Defaults to http://localhost:11434
)

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="http://192.168.1.100:11434"
)

export OLLAMA_API_BASE="http://localhost:11434"

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Hello!"}]
)

Streaming

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Write a story"}],
    api_base="http://localhost:11434",
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Function Calling

Ollama 0.4+ supports native function calling.

from litellm import completion

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "What's the weather in SF?"}],
    tools=tools,
    api_base="http://localhost:11434"
)

if response.choices[0].message.tool_calls:
    print("Tool calls:", response.choices[0].message.tool_calls)

Reasoning Models

Use reasoning capabilities with compatible models.

GPT-OSS (DeepSeek)
Other Models

from litellm import completion

response = completion(
    model="ollama/gpt-oss-120b",
    messages=[{"role": "user", "content": "Solve this problem..."}],
    reasoning_effort="medium",  # low, medium, high
    api_base="http://localhost:11434"
)

if response.choices[0].message.reasoning_content:
    print("Reasoning:", response.choices[0].message.reasoning_content)
print("Answer:", response.choices[0].message.content)

from litellm import completion

# Enable thinking for other models
response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Complex problem..."}],
    reasoning_effort="high",  # Enables thinking mode
    api_base="http://localhost:11434"
)

JSON Mode

JSON Object
JSON Schema

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "List 3 colors in JSON"}],
    response_format={"type": "json_object"},
    api_base="http://localhost:11434"
)

import json
data = json.loads(response.choices[0].message.content)

from litellm import completion

schema = {
    "type": "object",
    "properties": {
        "colors": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["colors"]
}

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "List 3 colors"}],
    response_format={
        "type": "json_schema",
        "json_schema": {"schema": schema}
    },
    api_base="http://localhost:11434"
)

Vision Models

Use vision-capable models with images.

from litellm import completion

response = completion(
    model="ollama/llava",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }],
    api_base="http://localhost:11434"
)

Embeddings

from litellm import embedding

response = embedding(
    model="ollama/nomic-embed-text",
    input=["Text to embed", "Another text"],
    api_base="http://localhost:11434"
)

embeddings = [data.embedding for data in response.data]

Advanced Configuration

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="http://localhost:11434",
    # OpenAI params
    temperature=0.8,
    max_tokens=500,
    top_p=0.9,
    frequency_penalty=0.5,
    seed=42,
    # Ollama-specific params
    num_ctx=4096,  # Context window size
    num_predict=200,  # Max tokens to generate
    repeat_penalty=1.1,  # Penalize repetition
    top_k=40,  # Top-k sampling
    mirostat=0,  # Mirostat sampling (0=off, 1=v1, 2=v2)
    keep_alive="5m"  # Keep model loaded
)

Supported Parameters

Parameter	Type	Description
`temperature`	float	Randomness (0-1)
`max_tokens`	int	Max output tokens
`max_completion_tokens`	int	Alternative to max_tokens
`top_p`	float	Nucleus sampling
`frequency_penalty`	float	Maps to repeat_penalty
`stop`	list	Stop sequences
`seed`	int	Reproducibility
`num_ctx`	int	Context window size
`num_predict`	int	Max tokens to generate
`repeat_penalty`	float	Penalize repetition
`top_k`	int	Top-k sampling
`mirostat`	int	Mirostat mode (0/1/2)
`keep_alive`	str	Keep model loaded duration

Error Handling

from litellm import completion
from litellm.exceptions import APIError

try:
    response = completion(
        model="ollama/llama3.3",
        messages=[{"role": "user", "content": "Hello!"}],
        api_base="http://localhost:11434"
    )
except APIError as e:
    print(f"Error: {e.status_code} - {e.message}")
    # Check if Ollama is running
    # Check if model is pulled

LiteLLM Proxy

model_list:
  - model_name: llama3.3
    litellm_params:
      model: ollama/llama3.3
      api_base: http://localhost:11434
  
  - model_name: codellama
    litellm_params:
      model: ollama/codellama
      api_base: http://192.168.1.100:11434

import openai

client = openai.OpenAI(
    api_key="sk-1234",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="llama3.3",
    messages=[{"role": "user", "content": "Hello!"}]
)

Best Practices

Model Management

Pull models before use: ollama pull model-name
Use keep_alive to keep frequently-used models loaded
Monitor system resources (RAM, GPU memory)

Performance

Use GPU acceleration when available
Adjust num_ctx based on your needs
Smaller models (7B/8B) for speed, larger (70B+) for quality

Function Calling

Requires Ollama 0.4+
Not all models support function calling equally
Test with your specific model before production

Troubleshooting

Connection Errors

# Check Ollama is running
ollama list

# Start Ollama if needed
ollama serve

Model Not Found

# Pull the model first
ollama pull llama3.3

# List available models
ollama list

Out of Memory

Use smaller models or quantized versions
Reduce num_ctx to lower memory usage
Close other applications

Providers

Provider Features

Overview

Quick Start

Popular Models

Configuration

Streaming

Function Calling

Reasoning Models

JSON Mode

Vision Models

Embeddings

Advanced Configuration

Supported Parameters

Error Handling

LiteLLM Proxy

Best Practices

Troubleshooting

Build docs developers (and LLMs) love

Providers

Provider Features

​Overview

​Quick Start

​Popular Models

​Configuration

​Streaming

​Function Calling

​Reasoning Models

​JSON Mode

​Vision Models

​Embeddings

​Advanced Configuration

​Supported Parameters

​Error Handling

​LiteLLM Proxy

​Best Practices

​Troubleshooting

Build docs developers (and LLMs) love

Overview

Quick Start

Popular Models

Configuration

Streaming

Function Calling

Reasoning Models

JSON Mode

Vision Models

Embeddings

Advanced Configuration

Supported Parameters

Error Handling

LiteLLM Proxy

Best Practices

Troubleshooting