Skip to main content
Two-stage retrieval first casts a wide net with fast vector search, then uses a cross-encoder model to precisely score the top candidates. Cross-encoders see query and document together, enabling much finer relevance judgments than embedding similarity alone.

How it works

Reranking implements a two-stage retrieval process to improve search quality over pure vector similarity.

Reranking process

  1. Candidate retrieval - Retrieve top-k candidates using fast ANN search
  2. Cross-encoder scoring - Apply cross-encoder to score query-document pairs
  3. Reranking - Sort candidates by cross-encoder scores
  4. Top-k selection - Return top rerank_k documents
  5. Optional RAG - Generate answer using reranked context

Cross-encoder vs bi-encoder

Bi-encoders (used in vector search)
  • Embed query and documents independently
  • Enable fast approximate nearest neighbor search
  • Cannot capture query-document interactions
  • Used in first-stage retrieval
Cross-encoders (used in reranking)
  • Process query and document together
  • Compute attention across both inputs
  • Capture fine-grained semantic interactions
  • Higher accuracy but slower (O(n) comparisons)
Typical pipelines retrieve candidates with bi-encoders, then rerank top-k with cross-encoders.

Key features

  • Models: modern cross-encoder rerankers and lightweight scoring models
  • Integrated evaluation with contextual recall, precision, and faithfulness metrics
  • Configurable candidate pool size and final result count
  • Compatible with all vector databases

Implementation

from vectordb.langchain.reranking import PineconeRerankingSearchPipeline

pipeline = PineconeRerankingSearchPipeline("config.yaml")
results = pipeline.search(
    query="renewable energy technologies",
    top_k=50,  # Candidate pool size
    rerank_k=10,  # Final results
    filters={"category": "science"},
)

print(f"Query: {results['query']}")
for doc in results["documents"]:
    print(f"Score: {doc.score:.3f} - {doc.content[:80]}...")
if "answer" in results:
    print(f"RAG Answer: {results['answer']}")

Configuration

Required settings

pinecone.api_key
string
required
Vector database API authentication
pinecone.index_name
string
required
Target index for candidate retrieval
reranker.model
string
required
Cross-encoder model for reranking

Optional settings

pinecone.namespace
string
Namespace for document isolation
embedder
object
Embedding model for candidate retrieval
llm
object
Optional LLM for RAG answer generation

Example configuration

pinecone:
  api_key: "${PINECONE_API_KEY}"
  index_name: "reranking-search"
  namespace: "production"

embedder:
  model_name: "all-MiniLM-L6-v2"

reranker:
  model: "cross-encoder/ms-marco-MiniLM-L-6-v2"

rag:
  enabled: true
  generator_model: "gpt-4o-mini"

Search parameters

query
string
required
Search query text to execute
top_k
integer
default:10
Number of candidates to retrieve before reranking. Higher values improve reranking quality but increase latency.
rerank_k
integer
default:5
Number of results to return after reranking. Should match your application’s result display needs.
filters
dict
Optional metadata filters for pre-filtering candidates

Fast models (low latency)

Fast, good accuracy. Best for production systems with latency requirements.
  • Layers: 6
  • Parameters: ~22M
  • Latency: ~10ms per pair
  • Use case: Default choice for most applications
Extremely fast, acceptable accuracy. For high-throughput scenarios.
  • Layers: 2
  • Parameters: ~4M
  • Latency: ~3ms per pair
  • Use case: High QPS, latency-critical

Accurate models (higher quality)

More accurate, slower. For offline evaluation or quality-critical applications.
  • Layers: 12
  • Parameters: ~33M
  • Latency: ~20ms per pair
  • Use case: Batch processing, high quality needs
Multilingual, high accuracy. For global applications.
  • Layers: 12
  • Parameters: ~568M
  • Latency: ~50ms per pair
  • Use case: Multilingual, maximum quality

Performance considerations

Candidate pool sizing

  • top_k controls the candidate pool size
  • Larger values improve reranking quality but increase latency
  • Recommended: 5-10x rerank_k
  • Example: top_k=50, rerank_k=10

Latency optimization

Cross-encoder latency scales linearly with top_k. For latency-sensitive applications:
  1. Use smaller cross-encoder models (6-layer or TinyBERT)
  2. Limit top_k to 20-30 candidates
  3. Cache reranking results for popular queries
Latency formula
total_latency = vector_search_latency + (top_k * cross_encoder_latency)

Quality vs speed tradeoff

Model sizetop_krerank_kQualityLatency
TinyBERT-L-2205Good~60ms
MiniLM-L-65010Better~500ms
MiniLM-L-1210020Best~2000ms

Use cases

When accuracy matters more than recall:
results = pipeline.search(
    query="FDA approval process for novel therapeutics",
    top_k=100,  # Wide net
    rerank_k=5,  # High-precision top results
    filters={"source": "regulatory_docs"},
)

Complex semantic queries

Queries requiring deep understanding:
results = pipeline.search(
    query="Compare gradient descent optimization in neural networks vs traditional convex optimization",
    top_k=50,
    rerank_k=10,
)

Domain-specific retrieval

Specialized domains with nuanced relevance:
results = pipeline.search(
    query="differential diagnosis for acute chest pain with ST elevation",
    top_k=30,
    rerank_k=5,
    filters={"specialty": "cardiology"},
)

Evaluation metrics

Reranking performance can be measured with:
  • Contextual recall - Fraction of relevant docs in reranked results
  • Precision@k - Accuracy of top-k reranked results
  • NDCG@k - Normalized discounted cumulative gain
  • MRR - Mean reciprocal rank of first relevant result
  • Faithfulness - Alignment between reranked context and answers

Implementation details

How cross-encoders work

# Bi-encoder (vector search)
query_vec = embed(query)  # Independent
doc_vec = embed(document)  # Independent
score = cosine_similarity(query_vec, doc_vec)

# Cross-encoder (reranking)
input_text = f"[CLS] {query} [SEP] {document} [SEP]"
score = cross_encoder(input_text)  # Joint encoding
Cross-encoders process query and document together, enabling:
  • Query-document attention
  • Fine-grained token interactions
  • Context-aware relevance scoring

Reranking helper

from vectordb.langchain.utils import RerankerHelper

# Create reranker
reranker = RerankerHelper.create_reranker({
    "reranker": {"model": "cross-encoder/ms-marco-MiniLM-L-6-v2"}
})

# Rerank documents
reranked_docs = RerankerHelper.rerank(
    reranker,
    query="machine learning",
    documents=candidates,
    top_k=10,
)

# Rerank with scores
reranked_with_scores = RerankerHelper.rerank_with_scores(
    reranker,
    query="machine learning",
    documents=candidates,
    top_k=10,
)

for doc, score in reranked_with_scores:
    print(f"Score: {score:.3f} - {doc.page_content[:50]}")

Semantic search

First-stage candidate retrieval

Hybrid search

Dense + sparse fusion before reranking

Contextual compression

Reduce context after reranking

MMR

Diversity-aware reranking

Build docs developers (and LLMs) love