Overview
DenseEmbeddingFunction is a Protocol class that defines the interface for dense vector embedding functions in Zvec. Dense embeddings map multimodal input (text, image, or audio) to fixed-length real-valued vectors.
Protocol Class: This is a Protocol class - it only defines the embed() interface. Implementations are free to define their own __init__, properties, and additional methods.
Class Definition
from zvec.extension import DenseEmbeddingFunction
from typing import Protocol, runtime_checkable
@runtime_checkable
class DenseEmbeddingFunction(Protocol[MD]):
"""Protocol for dense vector embedding functions."""
@abstractmethod
def embed(self, input: MD) -> DenseVectorType:
"""Generate a dense embedding vector for the input data."""
...
Location: python/zvec/extension/embedding_function.py:23
Type Parameters
- MD: The type of input data (bound to Embeddable: TEXT, IMAGE, or AUDIO)
Methods
embed()
def embed(self, input: MD) -> DenseVectorType:
"""Generate a dense embedding vector for the input data."""
Parameters:
input (MD): Multimodal input data to embed. Can be:
- TEXT (
str): Text string
- IMAGE (
str | bytes | np.ndarray): Image file path, raw bytes, or array
- AUDIO (
str | bytes | np.ndarray): Audio file path, raw bytes, or array
Returns:
DenseVectorType: A dense vector representing the embedding
- Can be
list[float], list[int], or np.ndarray
- Length should match the implementation’s dimension
Built-in Implementations
Zvec provides several built-in dense embedding implementations:
OpenAIDenseEmbedding
Dense embeddings using OpenAI’s API (text-embedding-3-small, text-embedding-3-large)
QwenDenseEmbedding
Dense embeddings using Alibaba’s Qwen/DashScope API
DefaultLocalDenseEmbedding
Local dense embeddings using all-MiniLM-L6-v2 model
Custom Implementation
Implement the embed() method to create your own dense embedding function:
Text Embedding Example
class MyTextEmbedding:
def __init__(self, dimension: int, model_name: str):
self.dimension = dimension
self.model = load_model(model_name)
def embed(self, input: str) -> list[float]:
"""Generate dense embedding for text input."""
return self.model.encode(input).tolist()
# Usage
emb = MyTextEmbedding(dimension=768, model_name="my-model")
vector = emb.embed("Hello world")
print(len(vector)) # 768
Image Embedding Example
import numpy as np
from typing import Union
class MyImageEmbedding:
def __init__(self, dimension: int = 512):
self.dimension = dimension
self.model = load_image_model()
def embed(self, input: Union[str, bytes, np.ndarray]) -> list[float]:
"""Generate dense embedding for image input."""
if isinstance(input, str):
# Load from file path
image = load_image_from_path(input)
else:
# Use bytes or array directly
image = input
return self.model.extract_features(image).tolist()
# Usage
emb = MyImageEmbedding(dimension=512)
# From file path
vector1 = emb.embed("/path/to/image.jpg")
# From numpy array
image_array = np.random.rand(224, 224, 3)
vector2 = emb.embed(image_array)
Usage with Built-in Implementations
OpenAI Example
from zvec.extension import OpenAIDenseEmbedding
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
emb_func = OpenAIDenseEmbedding(
model="text-embedding-3-small",
dimension=1536
)
vector = emb_func.embed("Hello world")
print(len(vector)) # 1536
Qwen Example
from zvec.extension import QwenDenseEmbedding
import os
os.environ["DASHSCOPE_API_KEY"] = "your-api-key"
emb_func = QwenDenseEmbedding(
dimension=1024,
model="text-embedding-v4"
)
vector = emb_func.embed("Machine learning")
print(len(vector)) # 1024
Local Model Example
from zvec.extension import DefaultLocalDenseEmbedding
# No API key needed - runs locally
emb_func = DefaultLocalDenseEmbedding()
vector = emb_func.embed("Hello world")
print(len(vector)) # 384
Common Patterns
Batch Processing
texts = ["First text", "Second text", "Third text"]
vectors = [emb_func.embed(text) for text in texts]
Semantic Similarity
import numpy as np
v1 = emb_func.embed("The cat sits on the mat")
v2 = emb_func.embed("A feline rests on a rug")
# Cosine similarity (if vectors are normalized)
similarity = np.dot(v1, v2)
print(f"Similarity: {similarity}")
Error Handling
try:
vector = emb_func.embed("") # Empty string
except ValueError as e:
print(f"Error: {e}")
# Error: Input text cannot be empty or whitespace only
Best Practices
Normalization: For cosine similarity calculations, normalize vectors to unit length:import numpy as np
vector = vector / np.linalg.norm(vector)
Dimension Consistency: Ensure all embeddings use the same dimension when building a vector database. Mixing dimensions will cause indexing errors.
Notes
- The
embed() method is the only required interface
- Implementations can define custom initialization parameters
- Most implementations provide a
dimension property
- Some implementations cache results for performance
- API-based implementations require network connectivity
See Also