Embedding Functions Overview

Overview

Embedding functions in Zvec transform multimodal input (text, image, or audio) into vector representations suitable for similarity search and retrieval. Zvec provides two primary types of embeddings:

Dense Embeddings: Fixed-length real-valued vectors (e.g., [0.123, -0.456, 0.789, ...])
Sparse Embeddings: Dictionary-based vectors with only non-zero dimensions (e.g., {10: 0.5, 245: 0.8, ...})

Base Classes

Zvec defines two Protocol classes that establish the embedding function interface:

DenseEmbeddingFunction

Protocol for dense vector embedding functions that map input to fixed-length vectors.

from zvec.extension import DenseEmbeddingFunction
from typing import Protocol

@runtime_checkable
class DenseEmbeddingFunction(Protocol[MD]):
    """Protocol for dense vector embedding functions."""
    
    def embed(self, input: MD) -> DenseVectorType:
        """Generate a dense embedding vector for the input data."""
        ...

Type Parameters:

MD: The type of input data (TEXT, IMAGE, or AUDIO)

Returns:

DenseVectorType: A list of floats, list of ints, or numpy array

See Dense Embeddings for detailed documentation.

SparseEmbeddingFunction

Protocol for sparse vector embedding functions that map input to sparse representations.

from zvec.extension import SparseEmbeddingFunction
from typing import Protocol

@runtime_checkable
class SparseEmbeddingFunction(Protocol[MD]):
    """Protocol for sparse vector embedding functions."""
    
    def embed(self, input: MD) -> SparseVectorType:
        """Generate a sparse embedding for the input data."""
        ...

Type Parameters:

MD: The type of input data (TEXT, IMAGE, or AUDIO)

Returns:

SparseVectorType: Dictionary mapping dimension index to non-zero weight

See Sparse Embeddings for detailed documentation.

Built-in Implementations

Zvec provides several ready-to-use embedding implementations:

Dense Embedding Implementations

OpenAI

Dense embeddings using OpenAI’s API

Qwen

Dense embeddings using Alibaba’s Qwen API

Sentence Transformers

Local dense embeddings with HuggingFace models

Sparse Embedding Implementations

BM25

Lexical search with BM25 algorithm

Qwen Sparse

Sparse embeddings using Qwen API

SPLADE

Local sparse embeddings with SPLADE model

Custom Implementations

You can create custom embedding functions by implementing the required embed() method. Since these are Protocol classes, you don’t need to explicitly inherit from them.

Custom Dense Embedding Example

class MyTextEmbedding:
    def __init__(self, dimension: int, model_name: str):
        self.dimension = dimension
        self.model = load_model(model_name)
    
    def embed(self, input: str) -> list[float]:
        return self.model.encode(input).tolist()

# Use your custom embedding
emb = MyTextEmbedding(dimension=768, model_name="my-model")
vector = emb.embed("Hello world")

Custom Sparse Embedding Example

class MyBM25Embedding:
    def __init__(self, vocab_size: int = 10000):
        self.vocab_size = vocab_size
        self.tokenizer = MyTokenizer()
    
    def embed(self, input: str) -> dict[int, float]:
        tokens = self.tokenizer.tokenize(input)
        sparse_vector = {}
        for token_id, weight in self._calculate_bm25(tokens):
            if weight > 0:
                sparse_vector[token_id] = weight
        return sparse_vector
    
    def _calculate_bm25(self, tokens):
        # BM25 calculation logic
        pass

Custom Image Embedding Example

import numpy as np
from typing import Union

class MyImageEmbedding:
    def __init__(self, dimension: int = 512):
        self.dimension = dimension
        self.model = load_image_model()
    
    def embed(self, input: Union[str, bytes, np.ndarray]) -> list[float]:
        if isinstance(input, str):
            image = load_image_from_path(input)
        else:
            image = input
        return self.model.extract_features(image).tolist()

Input Types

Embedding functions support multimodal input:

TEXT (str): Text strings
IMAGE (str | bytes | np.ndarray): Image file path, raw bytes, or array
AUDIO (str | bytes | np.ndarray): Audio file path, raw bytes, or array

Best Practices

Choosing Between Dense and Sparse

Dense embeddings: Better for semantic similarity and cross-modal search
Sparse embeddings: Better for exact keyword matching and interpretability
Hybrid approach: Combine both for optimal retrieval performance

API vs Local Models

API-based (OpenAI, Qwen): No setup, always up-to-date, requires network and API key
Local models (Sentence Transformers, BM25): No API costs, works offline, requires initial download

Normalization

For dense embeddings, normalize vectors to unit length when using cosine similarity:

import numpy as np
vector = vector / np.linalg.norm(vector)

Initialization

Collection

Schema Types

Query Types

Index Parameters

Embedding Functions

Re-ranking

Types & Enums

Embedding Functions Overview

Overview

Base Classes

DenseEmbeddingFunction

SparseEmbeddingFunction

Built-in Implementations

Dense Embedding Implementations

OpenAI

Qwen

Sentence Transformers

Sparse Embedding Implementations

BM25

Qwen Sparse

SPLADE

Custom Implementations

Custom Dense Embedding Example

Custom Sparse Embedding Example

Custom Image Embedding Example

Input Types

Best Practices

See Also

Build docs developers (and LLMs) love

Initialization

Collection

Schema Types

Query Types

Index Parameters

Embedding Functions

Re-ranking

Types & Enums

​Overview

​Base Classes

​DenseEmbeddingFunction

​SparseEmbeddingFunction

​Built-in Implementations

​Dense Embedding Implementations

OpenAI

Qwen

Sentence Transformers

​Sparse Embedding Implementations

BM25

Qwen Sparse

SPLADE

​Custom Implementations

​Custom Dense Embedding Example

​Custom Sparse Embedding Example

​Custom Image Embedding Example

​Input Types

​Best Practices

​See Also

Build docs developers (and LLMs) love

Overview

Base Classes

DenseEmbeddingFunction

SparseEmbeddingFunction

Built-in Implementations

Dense Embedding Implementations

Sparse Embedding Implementations

Custom Implementations

Custom Dense Embedding Example

Custom Sparse Embedding Example

Custom Image Embedding Example

Input Types

Best Practices

See Also