Skip to main content

Overview

DenseEmbeddingFunction is a Protocol class that defines the interface for dense vector embedding functions in Zvec. Dense embeddings map multimodal input (text, image, or audio) to fixed-length real-valued vectors.
Protocol Class: This is a Protocol class - it only defines the embed() interface. Implementations are free to define their own __init__, properties, and additional methods.

Class Definition

from zvec.extension import DenseEmbeddingFunction
from typing import Protocol, runtime_checkable

@runtime_checkable
class DenseEmbeddingFunction(Protocol[MD]):
    """Protocol for dense vector embedding functions."""
    
    @abstractmethod
    def embed(self, input: MD) -> DenseVectorType:
        """Generate a dense embedding vector for the input data."""
        ...
Location: python/zvec/extension/embedding_function.py:23

Type Parameters

  • MD: The type of input data (bound to Embeddable: TEXT, IMAGE, or AUDIO)

Methods

embed()

def embed(self, input: MD) -> DenseVectorType:
    """Generate a dense embedding vector for the input data."""
Parameters:
  • input (MD): Multimodal input data to embed. Can be:
    • TEXT (str): Text string
    • IMAGE (str | bytes | np.ndarray): Image file path, raw bytes, or array
    • AUDIO (str | bytes | np.ndarray): Audio file path, raw bytes, or array
Returns:
  • DenseVectorType: A dense vector representing the embedding
    • Can be list[float], list[int], or np.ndarray
    • Length should match the implementation’s dimension

Built-in Implementations

Zvec provides several built-in dense embedding implementations:

OpenAIDenseEmbedding

Dense embeddings using OpenAI’s API (text-embedding-3-small, text-embedding-3-large)

QwenDenseEmbedding

Dense embeddings using Alibaba’s Qwen/DashScope API

DefaultLocalDenseEmbedding

Local dense embeddings using all-MiniLM-L6-v2 model

Custom Implementation

Implement the embed() method to create your own dense embedding function:

Text Embedding Example

class MyTextEmbedding:
    def __init__(self, dimension: int, model_name: str):
        self.dimension = dimension
        self.model = load_model(model_name)
    
    def embed(self, input: str) -> list[float]:
        """Generate dense embedding for text input."""
        return self.model.encode(input).tolist()

# Usage
emb = MyTextEmbedding(dimension=768, model_name="my-model")
vector = emb.embed("Hello world")
print(len(vector))  # 768

Image Embedding Example

import numpy as np
from typing import Union

class MyImageEmbedding:
    def __init__(self, dimension: int = 512):
        self.dimension = dimension
        self.model = load_image_model()
    
    def embed(self, input: Union[str, bytes, np.ndarray]) -> list[float]:
        """Generate dense embedding for image input."""
        if isinstance(input, str):
            # Load from file path
            image = load_image_from_path(input)
        else:
            # Use bytes or array directly
            image = input
        return self.model.extract_features(image).tolist()

# Usage
emb = MyImageEmbedding(dimension=512)

# From file path
vector1 = emb.embed("/path/to/image.jpg")

# From numpy array
image_array = np.random.rand(224, 224, 3)
vector2 = emb.embed(image_array)

Usage with Built-in Implementations

OpenAI Example

from zvec.extension import OpenAIDenseEmbedding
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

emb_func = OpenAIDenseEmbedding(
    model="text-embedding-3-small",
    dimension=1536
)
vector = emb_func.embed("Hello world")
print(len(vector))  # 1536

Qwen Example

from zvec.extension import QwenDenseEmbedding
import os

os.environ["DASHSCOPE_API_KEY"] = "your-api-key"

emb_func = QwenDenseEmbedding(
    dimension=1024,
    model="text-embedding-v4"
)
vector = emb_func.embed("Machine learning")
print(len(vector))  # 1024

Local Model Example

from zvec.extension import DefaultLocalDenseEmbedding

# No API key needed - runs locally
emb_func = DefaultLocalDenseEmbedding()
vector = emb_func.embed("Hello world")
print(len(vector))  # 384

Common Patterns

Batch Processing

texts = ["First text", "Second text", "Third text"]
vectors = [emb_func.embed(text) for text in texts]

Semantic Similarity

import numpy as np

v1 = emb_func.embed("The cat sits on the mat")
v2 = emb_func.embed("A feline rests on a rug")

# Cosine similarity (if vectors are normalized)
similarity = np.dot(v1, v2)
print(f"Similarity: {similarity}")

Error Handling

try:
    vector = emb_func.embed("")  # Empty string
except ValueError as e:
    print(f"Error: {e}")
    # Error: Input text cannot be empty or whitespace only

Best Practices

Normalization: For cosine similarity calculations, normalize vectors to unit length:
import numpy as np
vector = vector / np.linalg.norm(vector)
Dimension Consistency: Ensure all embeddings use the same dimension when building a vector database. Mixing dimensions will cause indexing errors.

Notes

  • The embed() method is the only required interface
  • Implementations can define custom initialization parameters
  • Most implementations provide a dimension property
  • Some implementations cache results for performance
  • API-based implementations require network connectivity

See Also

Build docs developers (and LLMs) love