Semantic search retrieves documents by meaning rather than exact keyword overlap. Documents and queries are converted into dense vector embeddings by the same model, and similarity is measured by cosine distance in the embedding space.
How it works
Indexing
Each document’s text is embedded using HuggingFaceEmbeddings (created via EmbedderHelper.create_embedder(config)). The resulting float vector and document metadata are stored in the target backend through the backend’s LangChain integration.
Query embedding
At search time, the same embedder model embeds the query string via EmbedderHelper.embed_query(embedder, query).
Nearest-neighbor retrieval
The LangChain retriever performs approximate nearest-neighbor search and returns the top-k most similar documents.
Optional generation
If rag.enabled: true, retrieved documents are formatted into a prompt using RAGHelper.format_prompt() and passed to a ChatGroq LLM for answer generation.
Pipeline implementation
The semantic search pipeline is implemented as two classes per backend: one for indexing and one for search.
Indexing pipeline
src/vectordb/langchain/semantic_search/indexing/chroma.py
from vectordb.databases.chroma import ChromaVectorDB
from vectordb.dataloaders import DataloaderCatalog
from vectordb.langchain.utils import ConfigLoader, EmbedderHelper
class ChromaSemanticIndexingPipeline :
"""Chroma indexing pipeline for semantic search (LangChain).
Loads documents from configured data source, generates dense embeddings,
and indexes them in a local Chroma collection for similarity retrieval.
"""
def __init__ ( self , config_or_path : dict[ str , Any] | str ) -> None :
self .config = ConfigLoader.load(config_or_path)
ConfigLoader.validate( self .config, "chroma" )
# Create embedder from config
self .embedder = EmbedderHelper.create_embedder( self .config)
# Initialize Chroma database
chroma_config = self .config[ "chroma" ]
self .db = ChromaVectorDB(
path = chroma_config.get( "path" , "./chroma_data" ),
)
self .collection_name = chroma_config.get( "collection_name" , "semantic_search" )
def run ( self ) -> dict[ str , Any]:
"""Execute indexing pipeline."""
# Load documents with optional limit
limit = self .config.get( "dataloader" , {}).get( "limit" )
dl_config = self .config.get( "dataloader" , {})
loader = DataloaderCatalog.create(
dl_config.get( "type" , "triviaqa" ),
split = dl_config.get( "split" , "test" ),
limit = limit,
)
dataset = loader.load()
documents = dataset.to_langchain()
# Generate embeddings for all documents
docs, embeddings = EmbedderHelper.embed_documents( self .embedder, documents)
# Create or recreate collection
recreate = self .config.get( "chroma" , {}).get( "recreate" , False )
self .db.create_collection(
name = self .collection_name,
recreate = recreate,
)
# Upsert documents with embeddings to Chroma
num_indexed = self .db.upsert(
documents = docs,
embeddings = embeddings,
collection_name = self .collection_name,
)
return { "documents_indexed" : num_indexed}
Search pipeline
src/vectordb/langchain/semantic_search/search/chroma.py
from vectordb.databases.chroma import ChromaVectorDB
from vectordb.langchain.utils import ConfigLoader, EmbedderHelper, RAGHelper
from vectordb.utils.chroma_document_converter import ChromaDocumentConverter
class ChromaSemanticSearchPipeline :
"""Chroma semantic search pipeline (LangChain).
Implements dense vector similarity search on Chroma collections.
Queries are embedded and matched against stored document embeddings
to find semantically similar documents.
"""
def __init__ ( self , config_or_path : dict[ str , Any] | str ) -> None :
self .config = ConfigLoader.load(config_or_path)
ConfigLoader.validate( self .config, "chroma" )
self .embedder = EmbedderHelper.create_embedder( self .config)
chroma_config = self .config[ "chroma" ]
self .db = ChromaVectorDB(
path = chroma_config.get( "path" , "./chroma_data" ),
)
self .collection_name = chroma_config.get( "collection_name" , "semantic_search" )
self .llm = RAGHelper.create_llm( self .config)
def search (
self ,
query : str ,
top_k : int = 10 ,
filters : dict[ str , Any] | None = None ,
) -> dict[ str , Any]:
"""Execute semantic search against Chroma collection."""
# Embed query for similarity search
query_embedding = EmbedderHelper.embed_query( self .embedder, query)
self .db._get_collection( self .collection_name)
results_dict = self .db.query(
query_embedding = query_embedding,
n_results = top_k,
where = filters,
)
documents = (
ChromaDocumentConverter.convert_query_results_to_langchain_documents(
results_dict
)
)
result = {
"documents" : documents,
"query" : query,
}
# Generate RAG answer if LLM is configured
if self .llm is not None :
answer = RAGHelper.generate( self .llm, query, documents)
result[ "answer" ] = answer
return result
Configuration
chroma :
path : "./chroma_data" # Directory for local storage
collection_name : "documents" # Collection name
recreate : false # Whether to recreate collection
embeddings :
model : "sentence-transformers/all-MiniLM-L6-v2"
device : "cpu" # or "cuda" for GPU
batch_size : 32
dataloader :
dataset : "triviaqa"
split : "test"
limit : 500
search :
top_k : 10
rag :
enabled : false
Embedding helper
The EmbedderHelper provides static methods for creating and using HuggingFace embedding models:
src/vectordb/langchain/utils/embeddings.py
from langchain_huggingface import HuggingFaceEmbeddings
class EmbedderHelper :
"""Helper class for HuggingFace embedding model operations."""
@ classmethod
def create_embedder ( cls , config : dict[ str , Any]) -> HuggingFaceEmbeddings:
"""Create HuggingFaceEmbeddings from config."""
embeddings_config = config.get( "embeddings" , {})
model = embeddings_config.get( "model" , "sentence-transformers/all-MiniLM-L6-v2" )
device = embeddings_config.get( "device" , "cpu" )
batch_size = embeddings_config.get( "batch_size" , 32 )
return HuggingFaceEmbeddings(
model_name = model,
model_kwargs = { "device" : device},
encode_kwargs = { "batch_size" : batch_size},
)
@ classmethod
def embed_documents (
cls , embedder : HuggingFaceEmbeddings, documents : list[Document]
) -> tuple[list[Document], list[list[ float ]]]:
"""Embed documents and return with embeddings."""
texts = [doc.page_content for doc in documents]
embeddings = embedder.embed_documents(texts)
return documents, embeddings
@ classmethod
def embed_query ( cls , embedder : HuggingFaceEmbeddings, query : str ) -> list[ float ]:
"""Embed a single query."""
return embedder.embed_query(query)
RAG helper
The RAGHelper creates LLMs and formats prompts for answer generation:
src/vectordb/langchain/utils/rag.py
from langchain_groq import ChatGroq
class RAGHelper :
"""Helper for RAG-related operations."""
DEFAULT_PROMPT_TEMPLATE = """ {context}
Question: {query}
Answer:"""
@ classmethod
def create_llm ( cls , config : dict[ str , Any]) -> ChatGroq | None :
"""Create ChatGroq LLM from config."""
rag_config = config.get( "rag" , {})
if not rag_config.get( "enabled" , False ):
return None
model = rag_config.get( "model" , "llama-3.3-70b-versatile" )
api_key = rag_config.get( "api_key" ) or os.environ.get( "GROQ_API_KEY" )
temperature = rag_config.get( "temperature" , 0.7 )
max_tokens = rag_config.get( "max_tokens" , 2048 )
return ChatGroq(
model = model,
api_key = api_key,
temperature = temperature,
max_tokens = max_tokens,
)
@ classmethod
def generate (
cls ,
llm : ChatGroq,
query : str ,
documents : list[Document],
template : str | None = None ,
) -> str :
"""Generate RAG answer using LLM."""
prompt = cls .format_prompt(query, documents, template)
response = llm.invoke(prompt)
return response.content
When to use it
Natural-language questions where query phrasing differs from document vocabulary
General-purpose RAG baseline before specializing with advanced features
Any corpus where exact keyword overlap between query and documents is unreliable
When not to use it
Strict compliance or legal workflows where exact terms must appear verbatim
Very small corpora where BM25 already saturates quality
Keyword-heavy technical workloads where semantic generalization is unhelpful
Tradeoffs
Dimension What to expect Quality Strong semantic recall; may miss exact terminology Latency Low to moderate; dominated by embedding inference Cost Embedding compute + vector search cost per query
Settings to tune first
The primary quality lever; the model determines how semantically meaningful similarity scores are. Common choices:
sentence-transformers/all-MiniLM-L6-v2: Fast, 384-dimensional
sentence-transformers/all-mpnet-base-v2: Higher quality, 768-dimensional
BAAI/bge-small-en-v1.5: Strong retrieval performance
Controls the number of returned candidates; too small misses evidence, too large increases downstream cost.
Corpus size for experiments; start small to validate pipeline, then scale up.
Common pitfalls
Mismatched embedding models : Using a different model for indexing and querying produces meaningless similarity scores.
Oversized chunks : Large text chunks blur the embedding signal. Shorter, focused chunks typically produce better retrieval.
Too small top_k : If relevant evidence is rarely in the top 3 results, increase top_k and apply reranking rather than only tuning the embedding model.
Backends supported
Chroma, Milvus, Pinecone, Qdrant, Weaviate.
Next steps
Add reranking Add two-stage retrieval for better final-result precision
Hybrid search Switch to hybrid indexing if queries mix natural language with domain keywords
Metadata filtering Add metadata filtering if the corpus has reliable structured attributes
Components Explore reusable components for query enhancement and compression