Skip to main content

Overview

Ollama lets you run large language models locally. LiteLLM provides seamless integration with Ollama, supporting chat, embeddings, function calling, and reasoning models.

Quick Start

1

Install Ollama

Download and install Ollama from ollama.ai
# Pull a model
ollama pull llama3.3
2

Install LiteLLM

pip install litellm
3

Make Your First Call

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="http://localhost:11434"
)
print(response.choices[0].message.content)
Meta’s Llama models.
# Pull models
ollama pull llama3.3
ollama pull llama3.1
from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Explain AI"}],
    api_base="http://localhost:11434"
)

Configuration

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Hello!"}]
    # Defaults to http://localhost:11434
)

Streaming

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Write a story"}],
    api_base="http://localhost:11434",
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Function Calling

Ollama 0.4+ supports native function calling.
from litellm import completion

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "What's the weather in SF?"}],
    tools=tools,
    api_base="http://localhost:11434"
)

if response.choices[0].message.tool_calls:
    print("Tool calls:", response.choices[0].message.tool_calls)

Reasoning Models

Use reasoning capabilities with compatible models.
from litellm import completion

response = completion(
    model="ollama/gpt-oss-120b",
    messages=[{"role": "user", "content": "Solve this problem..."}],
    reasoning_effort="medium",  # low, medium, high
    api_base="http://localhost:11434"
)

if response.choices[0].message.reasoning_content:
    print("Reasoning:", response.choices[0].message.reasoning_content)
print("Answer:", response.choices[0].message.content)

JSON Mode

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "List 3 colors in JSON"}],
    response_format={"type": "json_object"},
    api_base="http://localhost:11434"
)

import json
data = json.loads(response.choices[0].message.content)

Vision Models

Use vision-capable models with images.
from litellm import completion

response = completion(
    model="ollama/llava",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }],
    api_base="http://localhost:11434"
)

Embeddings

from litellm import embedding

response = embedding(
    model="ollama/nomic-embed-text",
    input=["Text to embed", "Another text"],
    api_base="http://localhost:11434"
)

embeddings = [data.embedding for data in response.data]

Advanced Configuration

from litellm import completion

response = completion(
    model="ollama/llama3.3",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="http://localhost:11434",
    # OpenAI params
    temperature=0.8,
    max_tokens=500,
    top_p=0.9,
    frequency_penalty=0.5,
    seed=42,
    # Ollama-specific params
    num_ctx=4096,  # Context window size
    num_predict=200,  # Max tokens to generate
    repeat_penalty=1.1,  # Penalize repetition
    top_k=40,  # Top-k sampling
    mirostat=0,  # Mirostat sampling (0=off, 1=v1, 2=v2)
    keep_alive="5m"  # Keep model loaded
)

Supported Parameters

ParameterTypeDescription
temperaturefloatRandomness (0-1)
max_tokensintMax output tokens
max_completion_tokensintAlternative to max_tokens
top_pfloatNucleus sampling
frequency_penaltyfloatMaps to repeat_penalty
stoplistStop sequences
seedintReproducibility
num_ctxintContext window size
num_predictintMax tokens to generate
repeat_penaltyfloatPenalize repetition
top_kintTop-k sampling
mirostatintMirostat mode (0/1/2)
keep_alivestrKeep model loaded duration

Error Handling

from litellm import completion
from litellm.exceptions import APIError

try:
    response = completion(
        model="ollama/llama3.3",
        messages=[{"role": "user", "content": "Hello!"}],
        api_base="http://localhost:11434"
    )
except APIError as e:
    print(f"Error: {e.status_code} - {e.message}")
    # Check if Ollama is running
    # Check if model is pulled

LiteLLM Proxy

model_list:
  - model_name: llama3.3
    litellm_params:
      model: ollama/llama3.3
      api_base: http://localhost:11434
  
  - model_name: codellama
    litellm_params:
      model: ollama/codellama
      api_base: http://192.168.1.100:11434
import openai

client = openai.OpenAI(
    api_key="sk-1234",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="llama3.3",
    messages=[{"role": "user", "content": "Hello!"}]
)

Best Practices

  • Pull models before use: ollama pull model-name
  • Use keep_alive to keep frequently-used models loaded
  • Monitor system resources (RAM, GPU memory)
  • Use GPU acceleration when available
  • Adjust num_ctx based on your needs
  • Smaller models (7B/8B) for speed, larger (70B+) for quality
  • Requires Ollama 0.4+
  • Not all models support function calling equally
  • Test with your specific model before production

Troubleshooting

# Check Ollama is running
ollama list

# Start Ollama if needed
ollama serve
# Pull the model first
ollama pull llama3.3

# List available models
ollama list
  • Use smaller models or quantized versions
  • Reduce num_ctx to lower memory usage
  • Close other applications

Build docs developers (and LLMs) love