Skip to main content

Overview

Embedding functions in Zvec transform multimodal input (text, image, or audio) into vector representations suitable for similarity search and retrieval. Zvec provides two primary types of embeddings:
  • Dense Embeddings: Fixed-length real-valued vectors (e.g., [0.123, -0.456, 0.789, ...])
  • Sparse Embeddings: Dictionary-based vectors with only non-zero dimensions (e.g., {10: 0.5, 245: 0.8, ...})

Base Classes

Zvec defines two Protocol classes that establish the embedding function interface:

DenseEmbeddingFunction

Protocol for dense vector embedding functions that map input to fixed-length vectors.
from zvec.extension import DenseEmbeddingFunction
from typing import Protocol

@runtime_checkable
class DenseEmbeddingFunction(Protocol[MD]):
    """Protocol for dense vector embedding functions."""
    
    def embed(self, input: MD) -> DenseVectorType:
        """Generate a dense embedding vector for the input data."""
        ...
Type Parameters:
  • MD: The type of input data (TEXT, IMAGE, or AUDIO)
Returns:
  • DenseVectorType: A list of floats, list of ints, or numpy array
See Dense Embeddings for detailed documentation.

SparseEmbeddingFunction

Protocol for sparse vector embedding functions that map input to sparse representations.
from zvec.extension import SparseEmbeddingFunction
from typing import Protocol

@runtime_checkable
class SparseEmbeddingFunction(Protocol[MD]):
    """Protocol for sparse vector embedding functions."""
    
    def embed(self, input: MD) -> SparseVectorType:
        """Generate a sparse embedding for the input data."""
        ...
Type Parameters:
  • MD: The type of input data (TEXT, IMAGE, or AUDIO)
Returns:
  • SparseVectorType: Dictionary mapping dimension index to non-zero weight
See Sparse Embeddings for detailed documentation.

Built-in Implementations

Zvec provides several ready-to-use embedding implementations:

Dense Embedding Implementations

OpenAI

Dense embeddings using OpenAI’s API

Qwen

Dense embeddings using Alibaba’s Qwen API

Sentence Transformers

Local dense embeddings with HuggingFace models

Sparse Embedding Implementations

BM25

Lexical search with BM25 algorithm

Qwen Sparse

Sparse embeddings using Qwen API

SPLADE

Local sparse embeddings with SPLADE model

Custom Implementations

You can create custom embedding functions by implementing the required embed() method. Since these are Protocol classes, you don’t need to explicitly inherit from them.

Custom Dense Embedding Example

class MyTextEmbedding:
    def __init__(self, dimension: int, model_name: str):
        self.dimension = dimension
        self.model = load_model(model_name)
    
    def embed(self, input: str) -> list[float]:
        return self.model.encode(input).tolist()

# Use your custom embedding
emb = MyTextEmbedding(dimension=768, model_name="my-model")
vector = emb.embed("Hello world")

Custom Sparse Embedding Example

class MyBM25Embedding:
    def __init__(self, vocab_size: int = 10000):
        self.vocab_size = vocab_size
        self.tokenizer = MyTokenizer()
    
    def embed(self, input: str) -> dict[int, float]:
        tokens = self.tokenizer.tokenize(input)
        sparse_vector = {}
        for token_id, weight in self._calculate_bm25(tokens):
            if weight > 0:
                sparse_vector[token_id] = weight
        return sparse_vector
    
    def _calculate_bm25(self, tokens):
        # BM25 calculation logic
        pass

Custom Image Embedding Example

import numpy as np
from typing import Union

class MyImageEmbedding:
    def __init__(self, dimension: int = 512):
        self.dimension = dimension
        self.model = load_image_model()
    
    def embed(self, input: Union[str, bytes, np.ndarray]) -> list[float]:
        if isinstance(input, str):
            image = load_image_from_path(input)
        else:
            image = input
        return self.model.extract_features(image).tolist()

Input Types

Embedding functions support multimodal input:
  • TEXT (str): Text strings
  • IMAGE (str | bytes | np.ndarray): Image file path, raw bytes, or array
  • AUDIO (str | bytes | np.ndarray): Audio file path, raw bytes, or array

Best Practices

  • Dense embeddings: Better for semantic similarity and cross-modal search
  • Sparse embeddings: Better for exact keyword matching and interpretability
  • Hybrid approach: Combine both for optimal retrieval performance
  • API-based (OpenAI, Qwen): No setup, always up-to-date, requires network and API key
  • Local models (Sentence Transformers, BM25): No API costs, works offline, requires initial download
For dense embeddings, normalize vectors to unit length when using cosine similarity:
import numpy as np
vector = vector / np.linalg.norm(vector)

See Also

Build docs developers (and LLMs) love