Skip to main content
The OllamaEmbeddings class provides integration with locally-hosted Ollama models for privacy-focused, offline embedding generation.

Installation

Install Ollama

First, install and set up Ollama on your system:
# Download from https://ollama.com
# Or use package manager:
brew install ollama  # macOS

Pull an embedding model

ollama pull llama3
View available models at Ollama Model Library.

Install LangChain integration

pip install langchain-ollama

Usage

Start Ollama server

ollama serve

Basic usage

from langchain_ollama import OllamaEmbeddings

embed = OllamaEmbeddings(
    model="llama3"
)

Embed single text

text = "The meaning of life is 42"
vector = embed.embed_query(text)
print(vector[:3])
[-0.024603435769677162, -0.007543657906353474, 0.0039630369283258915]

Embed multiple texts

texts = ["Document 1...", "Document 2..."]
vectors = embed.embed_documents(texts)
print(len(vectors))
print(vectors[0][:3])
2
[-0.024603435769677162, -0.007543657906353474, 0.0039630369283258915]

Async usage

vector = await embed.aembed_query(text)
vectors = await embed.aembed_documents(texts)

Configuration

Supported models

Popular embedding models on Ollama:
  • llama3 - Meta’s Llama 3 model
  • nomic-embed-text - Nomic’s text embedding model
  • mxbai-embed-large - MixedBread.ai’s large embedding model
  • all-minilm - Sentence-transformers MiniLM model
View and pull models:
# List pulled models
ollama list

# Pull a specific model
ollama pull nomic-embed-text

# Pull specific version
ollama pull llama3:13b-v1.5-16k-q4_0

Custom base URL

Connect to Ollama running on a different host:
embed = OllamaEmbeddings(
    model="llama3",
    base_url="http://192.168.1.100:11434"
)

Authentication

For Ollama behind a proxy:
embed = OllamaEmbeddings(
    model="llama3",
    base_url="http://username:password@localhost:11434"
)
userinfo authentication is not secure and should only be used for local testing or in secure environments. Avoid using it in production or over unsecured networks.

Model parameters

Configure sampling and performance:
embed = OllamaEmbeddings(
    model="llama3",
    temperature=0.8,
    num_ctx=2048,  # Context window size
    num_gpu=1,  # Number of GPUs to use
    num_thread=8  # CPU threads for computation
)

Keep-alive

Control how long the model stays loaded:
embed = OllamaEmbeddings(
    model="llama3",
    keep_alive=300  # Keep loaded for 5 minutes
)

Validate model on init

Check if model exists locally before using:
embed = OllamaEmbeddings(
    model="llama3",
    validate_model_on_init=True
)

Parameters

model
string
required
Name of the Ollama model to use.
base_url
string
Base URL where Ollama is hosted. Defaults to Ollama client default (usually http://localhost:11434).
validate_model_on_init
boolean
default:"false"
Whether to validate the model exists in Ollama locally on initialization.
temperature
float
Sampling temperature. Higher values make output more creative.
num_ctx
integer
default:"2048"
Size of the context window.
num_gpu
integer
Number of GPUs to use. Defaults to 1 on macOS (for Metal support), 0 to disable.
num_thread
integer
Number of threads for computation. Defaults to optimal performance based on system.
keep_alive
integer
How long (in seconds) the model stays loaded in memory. Defaults to 300 seconds (5 minutes).
client_kwargs
object
default:"{}"
Additional kwargs to pass to httpx client (e.g., headers).

Build docs developers (and LLMs) love