Skip to main content

Overview

OpenAIDenseEmbedding provides text-to-vector embedding capabilities using OpenAI’s embedding models. It supports various models with configurable dimensions and includes automatic result caching for improved performance. Location: python/zvec/extension/openai_embedding_function.py:24

Installation

pip install openai

Class Definition

from zvec.extension import OpenAIDenseEmbedding

class OpenAIDenseEmbedding(DenseEmbeddingFunction[TEXT]):
    """Dense text embedding function using OpenAI API."""

Constructor

OpenAIDenseEmbedding(
    model: str = "text-embedding-3-small",
    dimension: Optional[int] = None,
    api_key: Optional[str] = None,
    base_url: Optional[str] = None,
    **kwargs
)

Parameters

model
str
default:"text-embedding-3-small"
OpenAI embedding model identifier. Common options:
  • text-embedding-3-small: 1536 dims, cost-efficient, good performance
  • text-embedding-3-large: 3072 dims, highest quality
  • text-embedding-ada-002: 1536 dims, legacy model
dimension
Optional[int]
default:"None"
Desired output embedding dimension. If None, uses model’s default dimension. For text-embedding-3 models, you can specify custom dimensions (e.g., 256, 512, 1024, 1536).
api_key
Optional[str]
default:"None"
OpenAI API authentication key. If None, reads from OPENAI_API_KEY environment variable. Obtain your key from: https://platform.openai.com/api-keys
base_url
Optional[str]
default:"None"
Custom API base URL for OpenAI-compatible services (e.g., Azure OpenAI). Defaults to official OpenAI endpoint.
**kwargs
dict
Additional parameters for API calls:
  • encoding_format (str): Format of embeddings, “float” or “base64”
  • user (str): User identifier for tracking

Properties

dimension

@property
def dimension(self) -> int:
    """The embedding vector dimension."""

extra_params

@property
def extra_params(self) -> dict:
    """Extra parameters for model-specific customization."""

Methods

embed()

@lru_cache(maxsize=10)
def embed(self, input: TEXT) -> DenseVectorType:
    """Generate dense embedding vector for the input text."""
Parameters:
  • input (str): Input text string to embed. Must be non-empty. Maximum length is 8191 tokens for most models.
Returns:
  • DenseVectorType: A list of floats representing the embedding vector. Length equals self.dimension.
Raises:
  • TypeError: If input is not a string
  • ValueError: If input is empty/whitespace-only or API returns error
  • RuntimeError: If network connectivity issues or OpenAI service errors occur
The embed() method is cached with LRU cache (maxsize=10). Identical inputs return cached results.

Usage Examples

Basic Usage

from zvec.extension import OpenAIDenseEmbedding
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

emb_func = OpenAIDenseEmbedding()
vector = emb_func.embed("Hello, world!")
print(len(vector))  # 1536

Custom Model and Dimension

emb_func = OpenAIDenseEmbedding(
    model="text-embedding-3-large",
    dimension=1024,
    api_key="sk-..."
)

vector = emb_func.embed("Machine learning is fascinating")
print(len(vector))  # 1024

Azure OpenAI

emb_func = OpenAIDenseEmbedding(
    model="text-embedding-ada-002",
    api_key="your-azure-key",
    base_url="https://your-resource.openai.azure.com/"
)

vector = emb_func.embed("Natural language processing")
print(isinstance(vector, list))  # True

Batch Processing with Caching

texts = ["First text", "Second text", "First text"]
vectors = [emb_func.embed(text) for text in texts]
# Third call uses cached result for "First text"

Callable Interface

emb_func = OpenAIDenseEmbedding()

# Both work the same
vector1 = emb_func.embed("Hello world")
vector2 = emb_func("Hello world")

Semantic Similarity

import numpy as np

emb_func = OpenAIDenseEmbedding()

v1 = emb_func.embed("The cat sits on the mat")
v2 = emb_func.embed("A feline rests on a rug")
v3 = emb_func.embed("Python programming")

# Normalize vectors
v1_norm = v1 / np.linalg.norm(v1)
v2_norm = v2 / np.linalg.norm(v2)
v3_norm = v3 / np.linalg.norm(v3)

# Calculate cosine similarity
similarity_high = np.dot(v1_norm, v2_norm)  # Similar sentences
similarity_low = np.dot(v1_norm, v3_norm)   # Different topics

print(similarity_high > similarity_low)  # True

Error Handling

try:
    vector = emb_func.embed("")  # Empty string
except ValueError as e:
    print(f"Error: {e}")
    # Error: Input text cannot be empty or whitespace only

try:
    vector = emb_func.embed(123)  # Wrong type
except TypeError as e:
    print(f"Error: {e}")
    # Error: Expected 'input' to be str, got int

Model Comparison

text-embedding-3-small

Dimensions: 1536
Cost: Low
Performance: Good

text-embedding-3-large

Dimensions: 3072
Cost: Medium
Performance: Highest

text-embedding-ada-002

Dimensions: 1536
Cost: Low
Performance: Legacy

Best Practices

Environment Variables: Store API keys in environment variables instead of hardcoding:
export OPENAI_API_KEY="sk-..."
Then simply:
emb_func = OpenAIDenseEmbedding()  # Auto-reads from env
Rate Limits: OpenAI API has rate limits based on your account tier. Handle rate limit errors appropriately:
import time
from openai import RateLimitError

def embed_with_retry(text, max_retries=3):
    for attempt in range(max_retries):
        try:
            return emb_func.embed(text)
        except RateLimitError:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise
Caching: Results are cached automatically. For better cache hits, consider:
  • Normalizing text (lowercasing, whitespace trimming)
  • Pre-processing inputs consistently
  • Using the same embedding instance for similar queries

Notes

  • Requires Python 3.10, 3.11, or 3.12
  • Requires the openai package: pip install openai
  • Network connectivity to OpenAI API endpoints is required
  • API usage incurs costs based on your OpenAI subscription plan
  • Rate limits apply based on your OpenAI account tier
  • Embedding results are cached (LRU cache, maxsize=10)

See Also

Build docs developers (and LLMs) love