Overview
The embeddings module provides a unified interface for generating text embeddings through multiple providers (cloud APIs or local models), with intelligent caching and automatic fallback strategies.
Key Features:
Multi-provider support : Cloud (via LiteLLM) or local (sentence-transformers)
Mode flexibility : AUTO, LOCAL, CLOUD, or HYBRID with automatic fallback
Built-in caching : Thread-safe LRU cache with fingerprint-based invalidation
Batch processing : Efficient batch embedding with per-text cache lookups
Async/sync APIs : Both async and synchronous wrappers available
Source: src/utils/embeddings/
EmbeddingManager
The EmbeddingManager is the primary interface for all embedding operations. Always use this class rather than instantiating providers directly.
Constructor
from src.utils.embeddings.manager import EmbeddingManager, EmbeddingMode
manager = EmbeddingManager(
mode = EmbeddingMode. AUTO ,
cloud_config = None ,
local_config = None ,
domain = "guantanamo"
)
mode
EmbeddingMode
default: "None"
Embedding generation mode. If None, reads from EMBEDDING_MODE env var or domain config. Options:
AUTO: Auto-detect (local if sentence-transformers available, else cloud)
LOCAL: Force local embedding with sentence-transformers
CLOUD: Force cloud embedding via LiteLLM
HYBRID: Try cloud first, fallback to local on failure
cloud_config
EmbeddingConfig
default: "None"
Cloud provider configuration. If None, loads from domain config.
local_config
EmbeddingConfig
default: "None"
Local provider configuration. If None, loads from domain config.
Domain name for loading configuration and cache settings.
Core Methods
embed_text
await manager.embed_text(text: str , use_cache: bool = True ) -> List[ float ]
Embed a single text asynchronously with automatic caching.
Whether to use LRU cache for lookup and storage.
The embedding vector for the input text.
import asyncio
from src.utils.embeddings.manager import EmbeddingManager
manager = EmbeddingManager()
vector = await manager.embed_text( "The Guantanamo Bay detention camp" )
print ( f "Embedding dimension: { len (vector) } " )
embed_batch
await manager.embed_batch(
texts: List[ str ],
use_cache: bool = True
) -> List[List[ float ]]
Embed multiple texts efficiently with per-text cache lookups.
Batch embedding is significantly faster than multiple single embeddings, especially for local models which process batches in a single GPU/CPU pass.
Whether to use LRU cache. Cache hits skip embedding computation.
List of embedding vectors, one per input text (in same order).
Cache-Aware Batching
Sync Batch with Metadata
from src.utils.embeddings.manager import EmbeddingManager
manager = EmbeddingManager()
texts = [
"Abdul Rahman Ahmed" ,
"Mohamedou Ould Slahi" ,
"Abdul Rahman Ahmed" , # Duplicate - will be cached
]
vectors = await manager.embed_batch(texts)
print ( f "Generated { len (vectors) } embeddings" )
print ( f "Cache stats: { manager.cache_stats } " )
# Output: {'hits': 1, 'misses': 2, 'hit_rate': 0.33, 'size': 2}
embed_text_result
await manager.embed_text_result(text: str ) -> EmbeddingResult
Embed a single text and return full metadata (model, dimension, usage).
Model name used to generate embeddings (e.g., "jinaai/jina-embeddings-v3").
Embedding dimension (e.g., 1024).
Token usage information (cloud providers only).
Additional provider-specific metadata.
Utility Methods
get_active_model_name
manager.get_active_model_name() -> str
Return the model name of the currently active provider.
manager = EmbeddingManager( mode = EmbeddingMode. CLOUD )
model = manager.get_active_model_name()
print (model) # "jinaai/jina-embeddings-v3"
cache_stats
manager.cache_stats -> dict
Return LRU cache hit/miss statistics for diagnostics.
Hit rate as a ratio (0.0 to 1.0).
Current number of cached items.
manager = EmbeddingManager()
await manager.embed_batch([ "text1" , "text2" , "text1" ])
stats = manager.cache_stats
print ( f "Hit rate: { stats[ 'hit_rate' ] :.1%} " ) # "Hit rate: 33.3%"
fingerprint_from_result
EmbeddingManager.fingerprint_from_result(
result: EmbeddingResult
) -> Optional[ str ]
Build a stable fingerprint string from an EmbeddingResult.
Format: "{model}:{dimension}" (e.g., "jinaai/jina-embeddings-v3:1024").
Used to detect when an entity’s stored embedding was produced by a different model than the currently active one.
result = await manager.embed_text_result( "Some text" )
fingerprint = EmbeddingManager.fingerprint_from_result(result)
print (fingerprint) # "jinaai/jina-embeddings-v3:1024"
Synchronous Wrappers
All async methods have synchronous equivalents:
Async Method Sync Wrapper embed_text()embed_text_sync()embed_batch()embed_batch_sync()embed_text_result()embed_text_result_sync()— embed_batch_result_sync()
Sync wrappers use a persistent event loop to avoid asyncio.run() overhead per call.
Embedding Providers
Base Classes
EmbeddingConfig
from src.utils.embeddings.base import EmbeddingConfig
config = EmbeddingConfig(
model_name = "jinaai/jina-embeddings-v3" ,
batch_size = 32 ,
max_retries = 3 ,
timeout = 30 ,
device = "auto" ,
metadata = { "project" : "hinbox" }
)
Model identifier (e.g., "jinaai/jina-embeddings-v3", "text-embedding-3-small").
Number of texts to process in a single batch.
Maximum retry attempts for cloud API calls.
Request timeout in seconds (cloud providers only).
Device for local models: "auto", "cpu", "cuda", or "mps".
"auto" lets sentence-transformers choose the best available device.
metadata
Dict[str, Any]
default: "{}"
Additional metadata to attach to API calls.
EmbeddingResult
Returned by embed_text_result() and provider embed_batch() methods.
See embed_text_result for field details.
CloudEmbeddingProvider
Source: src/utils/embeddings/cloud.py
Cloud embedding provider using LiteLLM for multi-provider API access.
Supports any embedding model available through LiteLLM, including OpenAI, Jina AI, Cohere, and others.
from src.utils.embeddings.cloud import CloudEmbeddingProvider
from src.utils.embeddings.base import EmbeddingConfig
config = EmbeddingConfig(
model_name = "jinaai/jina-embeddings-v3" ,
batch_size = 100 ,
max_retries = 3 ,
metadata = { "project" : "hinbox" }
)
provider = CloudEmbeddingProvider(config)
vector = await provider.embed_single( "The detention camp at Guantanamo Bay" )
Key Features:
Exponential backoff retry with configurable attempts
Empty text filtering (returns empty vectors)
Usage tracking (token counts)
Metadata propagation to LiteLLM callbacks
LocalEmbeddingProvider
Source: src/utils/embeddings/local.py
Local embedding provider using sentence-transformers (PyTorch).
Requires the local-embeddings extra: pip install 'hinbox[local-embeddings]' The provider lazily imports sentence-transformers only when first used, so environments without PyTorch can still run the application in cloud mode.
from src.utils.embeddings.local import LocalEmbeddingProvider
from src.utils.embeddings.base import EmbeddingConfig
config = EmbeddingConfig(
model_name = "jinaai/jina-embeddings-v3" , # or "huggingface/..."
batch_size = 32 ,
device = "auto" # or "cuda", "cpu", "mps"
)
provider = LocalEmbeddingProvider(config)
vectors = await provider.embed_batch([ "text1" , "text2" ])
Model Name Transformations:
huggingface/... → removes prefix (sentence-transformers default)
jina_ai/... → converts to jinaai/... (sentence-transformers format)
Device Resolution:
"auto": Let sentence-transformers auto-detect (CUDA > MPS > CPU)
"cuda", "cpu", "mps": Explicit device selection
Global Manager
get_default_manager
from src.utils.embeddings.manager import get_default_manager
manager = get_default_manager()
vector = await manager.embed_text( "Some text" )
Returns a singleton EmbeddingManager instance with default configuration.
The default manager is lazily initialized on first call. Subsequent calls return the same instance.
Configuration
Domain Config
Embedding configuration is typically loaded from domain YAML:
configs/guantanamo/config.yaml
embeddings :
mode : auto # auto | local | cloud | hybrid
cloud :
model : jinaai/jina-embeddings-v3
batch_size : 100
max_retries : 3
timeout : 30
local :
model : jinaai/jina-embeddings-v3
batch_size : 32
device : auto
cache :
enabled : true
embeddings :
lru_max_items : 4096 # Set to 0 to disable LRU cache
Environment Variables
Override embedding mode: auto, local, cloud, or hybrid.
Takes precedence over domain config.
export EMBEDDING_MODE = local
python process_and_extract.py
Advanced Usage
Hybrid Mode with Fallback
from src.utils.embeddings.manager import EmbeddingManager, EmbeddingMode
# Try cloud first, fallback to local if cloud fails
manager = EmbeddingManager( mode = EmbeddingMode. HYBRID )
try :
vector = await manager.embed_text( "Important text" )
# Cloud provider used
except Exception :
# Automatically fell back to local provider
pass
Custom Provider Configuration
from src.utils.embeddings.manager import EmbeddingManager, EmbeddingMode
from src.utils.embeddings.base import EmbeddingConfig
cloud_config = EmbeddingConfig(
model_name = "text-embedding-3-large" ,
batch_size = 50 ,
metadata = { "project" : "hinbox" , "domain" : "custom" }
)
local_config = EmbeddingConfig(
model_name = "huggingface/sentence-transformers/all-MiniLM-L6-v2" ,
batch_size = 64 ,
device = "cuda"
)
manager = EmbeddingManager(
mode = EmbeddingMode. HYBRID ,
cloud_config = cloud_config,
local_config = local_config,
domain = "custom"
)
Cache Fingerprint Validation
from src.utils.embeddings.manager import EmbeddingManager
manager = EmbeddingManager()
# Embed and get fingerprint
result = await manager.embed_text_result( "Entity name" )
fingerprint = EmbeddingManager.fingerprint_from_result(result)
# Later: check if stored embedding matches current model
stored_fingerprint = "jinaai/jina-embeddings-v3:1024"
current_fingerprint = manager.fingerprint_from_result(result)
if stored_fingerprint != current_fingerprint:
print ( "Re-embedding required: model changed" )
See Also