Skip to main content

Overview

The OpenAIDenseEmbedding class provides text-to-vector embedding capabilities using OpenAI’s embedding models. It supports various models with different dimensions and includes automatic result caching for improved performance.

Installation

pip install openai

Authentication

Set your OpenAI API key as an environment variable:
export OPENAI_API_KEY="sk-..."
Alternatively, pass the API key directly to the constructor:
emb_func = OpenAIDenseEmbedding(api_key="sk-...")
Obtain your API key from OpenAI Platform.

Basic Usage

from zvec.extension import OpenAIDenseEmbedding
import os

# Set API key
os.environ["OPENAI_API_KEY"] = "sk-..."

# Initialize with default model (text-embedding-3-small)
emb_func = OpenAIDenseEmbedding()
vector = emb_func.embed("Hello, world!")

print(f"Dimension: {len(vector)}")
# Output: Dimension: 1536

Model Selection

OpenAI offers several embedding models:
ModelDimensionsDescription
text-embedding-3-small1536Cost-efficient, good performance (default)
text-embedding-3-large3072Highest quality
text-embedding-ada-0021536Legacy model
# Using text-embedding-3-large
emb_func = OpenAIDenseEmbedding(
    model="text-embedding-3-large",
    dimension=1024,  # Optional: custom dimension
    api_key="sk-..."
)

vector = emb_func.embed("Machine learning is fascinating")
print(f"Dimension: {len(vector)}")
# Output: Dimension: 1024

Custom Dimensions

For text-embedding-3 models, you can specify custom dimensions to reduce vector size:
emb_func = OpenAIDenseEmbedding(
    model="text-embedding-3-small",
    dimension=512  # Reduce from default 1536 to 512
)

vector = emb_func.embed("Natural language processing")
print(f"Dimension: {len(vector)}")
# Output: Dimension: 512

Azure OpenAI

Use a custom base URL for Azure OpenAI or compatible services:
emb_func = OpenAIDenseEmbedding(
    model="text-embedding-ada-002",
    api_key="your-azure-key",
    base_url="https://your-resource.openai.azure.com/"
)

vector = emb_func.embed("Azure OpenAI integration")

Using with Zvec Collections

from zvec import Collection, DataType
from zvec.extension import OpenAIDenseEmbedding

# Initialize embedding function
emb_func = OpenAIDenseEmbedding(
    model="text-embedding-3-small",
    dimension=1536
)

# Create collection with OpenAI embeddings
collection = Collection(name="documents")
collection.create_field("id", DataType.INT64, is_primary=True)
collection.create_field("text", DataType.VARCHAR, max_length=512)
collection.create_field(
    name="vector",
    dtype=DataType.VECTOR_FP32,
    dimension=1536,
    embedding_function=emb_func
)
collection.create()

# Insert data - embeddings are generated automatically
collection.insert([
    {"id": 1, "text": "Introduction to machine learning"},
    {"id": 2, "text": "Deep learning with neural networks"},
    {"id": 3, "text": "Natural language processing basics"}
])

# Query with automatic embedding
results = collection.query(
    data={"vector": ["machine learning algorithms"]},
    output_fields=["id", "text"],
    topk=2
)

for result in results:
    print(f"ID: {result['id']}, Text: {result['text']}")

Batch Processing

The embedding function includes automatic caching for repeated inputs:
emb_func = OpenAIDenseEmbedding()

texts = [
    "First document",
    "Second document",
    "First document"  # This will use cached result
]

vectors = [emb_func.embed(text) for text in texts]
# Third call returns cached result for "First document"

Error Handling

try:
    emb_func.embed("")  # Empty string
except ValueError as e:
    print(f"Error: {e}")
    # Output: Error: Input text cannot be empty or whitespace only

try:
    emb_func.embed(123)  # Non-string input
except TypeError as e:
    print(f"Error: {e}")
    # Output: Error: Expected 'input' to be str, got int

Configuration Options

model
string
default:"text-embedding-3-small"
OpenAI embedding model identifier
dimension
int
default:"None"
Desired output embedding dimension. If None, uses model’s default dimension
api_key
string
default:"None"
OpenAI API authentication key. If None, reads from OPENAI_API_KEY environment variable
base_url
string
default:"None"
Custom API base URL for OpenAI-compatible services

Notes

  • Results are cached (LRU cache, maxsize=10) to reduce API calls
  • API usage incurs costs based on your OpenAI subscription plan
  • Rate limits apply based on your OpenAI account tier
  • Network connectivity to OpenAI API endpoints is required
  • Maximum input length is 8191 tokens for most models

See Also

Build docs developers (and LLMs) love