Skip to main content

Overview

Ollama enables you to run large language models locally on your own hardware. Perfect for development, testing, privacy-sensitive applications, and offline use. Access Llama, Mistral, Gemma, and many more models without any API costs. Base URL: Your local Ollama server (default: http://localhost:11434)

Supported Features

  • ✅ Chat Completions
  • ✅ Streaming
  • ✅ Embeddings
  • ✅ Vision (multimodal models)
  • ✅ Custom Models
  • ✅ Model Library (100+ models)
  • ❌ Function Calling (limited support)
  • ❌ Image Generation

Prerequisites

Install Ollama

# Download from ollama.com or use:
brew install ollama

# Start Ollama
ollama serve

Pull a Model

# Pull Llama 3.1 (4.7GB)
ollama pull llama3.1

# Pull Mistral (4.1GB)
ollama pull mistral

# Pull a vision model
ollama pull llava

# List downloaded models
ollama list

Quick Start

Chat Completions

from portkey_ai import Portkey

client = Portkey(
    provider="ollama",
    custom_host="http://localhost:11434"  # Your Ollama server
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "Explain the benefits of running models locally"}
    ]
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Meta Llama

ModelSizeMemoryDescription
llama3.343GB32GBLatest Llama 3.3 70B
llama3.14.7GB8GBLlama 3.1 8B
llama3.1:70b40GB48GBLlama 3.1 70B
llama3.1:405b231GB256GB+Largest Llama
llama23.8GB8GBLlama 2 7B

Mistral & Mixtral

ModelSizeMemoryDescription
mistral4.1GB8GBMistral 7B
mistral-large40GB48GBMistral Large
mixtral26GB32GBMixtral 8x7B MoE

Google Gemma

ModelSizeMemoryDescription
gemma25.4GB8GBGemma 2 9B
gemma2:27b16GB20GBGemma 2 27B
gemma5.0GB8GBGemma 7B

Vision Models

ModelSizeMemoryDescription
llava4.7GB8GBLlama with vision
llava:34b20GB24GBLarger vision model
bakllava4.7GB8GBAlternative vision

Specialized Models

ModelSizePurpose
codellama3.8GBCode generation
phi32.3GBMicrosoft’s small model
qwen2.54.7GBMultilingual
deepseek-coder3.8GBAdvanced coding
nous-hermes24.1GBGeneral purpose
Ollama excels at:
  • Privacy - Data never leaves your machine
  • Zero cost - No API fees
  • Offline use - Works without internet
  • Fast iteration - No network latency
  • Customization - Create and modify models

Configuration Options

client = Portkey(
    provider="ollama",
    custom_host="http://localhost:11434"  # Your Ollama server URL
)

Remote Ollama Server

# Connect to Ollama on another machine
client = Portkey(
    provider="ollama",
    custom_host="http://192.168.1.100:11434"
)

Docker Container

# If Ollama is in Docker
client = Portkey(
    provider="ollama",
    custom_host="http://localhost:11434"
)

Advanced Features

System Messages

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful Python programming expert."
        },
        {
            "role": "user",
            "content": "How do I read a file?"
        }
    ]
)

Vision (Multimodal)

response = client.chat.completions.create(
    model="llava",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {"url": "https://example.com/image.jpg"}
            }
        ]
    }]
)
Local image:
import base64

with open("local_image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="llava",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}
            }
        ]
    }]
)

Embeddings

response = client.embeddings.create(
    model="llama3.1",
    input="Local embeddings with Ollama"
)

embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}")

Temperature Control

# Deterministic
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    temperature=0.0
)

# Creative
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Write a poem"}],
    temperature=1.0
)

Model Management

List Models

# List all downloaded models
ollama list

Pull Models

# Download a model
ollama pull llama3.1

# Pull specific size
ollama pull llama3.1:70b

# Pull with tag
ollama pull mistral:7b-instruct-v0.2-q4_0

Remove Models

# Free up space
ollama rm llama2

Run Interactive

# Chat with model in terminal
ollama run llama3.1

# Exit with /bye

Custom Models

Create a Custom Model

  1. Create a Modelfile:
FROM llama3.1

# Set custom parameters
PARAMETER temperature 0.8
PARAMETER top_p 0.9

# Set custom system message
SYSTEM You are a helpful Python coding assistant. Always provide working code examples.
  1. Create the model:
ollama create python-expert -f Modelfile
  1. Use your custom model:
response = client.chat.completions.create(
    model="python-expert",
    messages=[{"role": "user", "content": "Write a quicksort function"}]
)

Fallback Configuration

Use local Ollama first, fallback to cloud:
config = {
    "strategy": {"mode": "fallback"},
    "targets": [
        {
            "provider": "ollama",
            "custom_host": "http://localhost:11434",
            "override_params": {"model": "llama3.1"}
        },
        {
            "provider": "openai",
            "api_key": "sk-***",
            "override_params": {"model": "gpt-4o-mini"}
        }
    ]
}

client = Portkey().with_options(config=config)

Best Practices

  1. Choose appropriate model size - Match to your hardware
  2. Use quantized models - Smaller, faster (q4_0, q5_1)
  3. Monitor memory usage - Leave headroom for system
  4. Keep models updated - ollama pull to update
  5. Use GPU if available - Much faster inference
  6. Warm up models - First request may be slow
  7. Batch similar requests - Amortize startup cost
  8. Create custom models - Optimize for your use case

Hardware Requirements

Minimum Specs

  • CPU: Modern quad-core
  • RAM: 8GB (for 7B models)
  • Disk: 10GB free space
  • CPU: 8+ cores
  • RAM: 16GB+ (for 13B models)
  • GPU: NVIDIA with 8GB+ VRAM (optional but recommended)
  • Disk: 50GB+ SSD

For Larger Models

  • 70B models: 48GB+ RAM
  • 405B models: 256GB+ RAM or multi-GPU setup

Performance Tips

Use GPU

Ollama automatically uses GPU if available (NVIDIA, Apple Silicon).

Quantization Levels

SuffixSizeQualitySpeed
q4_0SmallestGoodFastest
q4_1SmallBetterFast
q5_0MediumGoodMedium
q5_1MediumBetterMedium
q8_0LargeBestSlow
(none)LargestPerfectSlowest
Example:
# Faster, smaller
ollama pull llama3.1:q4_0

# Best quality
ollama pull llama3.1:q8_0

Use Cases

Development & Testing

# Test locally before deploying to production
dev_client = Portkey(
    provider="ollama",
    custom_host="http://localhost:11434"
)

Privacy-Sensitive Applications

# Keep sensitive data on-premises
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Analyze this private data..."}]
)

Offline Applications

# Works without internet
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Help me while offline"}]
)

Cost Optimization

# Zero API costs
for query in large_batch:
    response = client.chat.completions.create(
        model="llama3.1",
        messages=[{"role": "user", "content": query}]
    )

Pricing

Ollama is completely free!
  • No API costs
  • No rate limits
  • No usage tracking
  • Run unlimited requests
Only costs: Your hardware and electricity

Troubleshooting

Model Not Found

# Make sure model is pulled
ollama pull llama3.1
ollama list  # Verify it's there

Out of Memory

# Use smaller model or quantized version
ollama pull llama3.1:q4_0

Slow Performance

# Check if GPU is being used
ollama ps

# Use smaller/quantized model
ollama pull mistral:7b-instruct-q4_0

Model Library

Browse 100+ available models

Fallback Routing

Fallback to cloud when needed

Cost Optimization

Optimize AI costs

Privacy

Private AI deployments

Build docs developers (and LLMs) love