How it works
Reranking implements a two-stage retrieval process to improve search quality over pure vector similarity.Reranking process
- Candidate retrieval - Retrieve top-k candidates using fast ANN search
- Cross-encoder scoring - Apply cross-encoder to score query-document pairs
- Reranking - Sort candidates by cross-encoder scores
- Top-k selection - Return top rerank_k documents
- Optional RAG - Generate answer using reranked context
Cross-encoder vs bi-encoder
Bi-encoders (used in vector search)- Embed query and documents independently
- Enable fast approximate nearest neighbor search
- Cannot capture query-document interactions
- Used in first-stage retrieval
- Process query and document together
- Compute attention across both inputs
- Capture fine-grained semantic interactions
- Higher accuracy but slower (O(n) comparisons)
Key features
- Models: modern cross-encoder rerankers and lightweight scoring models
- Integrated evaluation with contextual recall, precision, and faithfulness metrics
- Configurable candidate pool size and final result count
- Compatible with all vector databases
Implementation
Configuration
Required settings
Vector database API authentication
Target index for candidate retrieval
Cross-encoder model for reranking
Optional settings
Namespace for document isolation
Embedding model for candidate retrieval
Optional LLM for RAG answer generation
Example configuration
Search parameters
Search query text to execute
Number of candidates to retrieve before reranking. Higher values improve reranking quality but increase latency.
Number of results to return after reranking. Should match your application’s result display needs.
Optional metadata filters for pre-filtering candidates
Recommended models
Fast models (low latency)
cross-encoder/ms-marco-MiniLM-L-6-v2
cross-encoder/ms-marco-MiniLM-L-6-v2
Fast, good accuracy. Best for production systems with latency requirements.
- Layers: 6
- Parameters: ~22M
- Latency: ~10ms per pair
- Use case: Default choice for most applications
cross-encoder/ms-marco-TinyBERT-L-2-v2
cross-encoder/ms-marco-TinyBERT-L-2-v2
Extremely fast, acceptable accuracy. For high-throughput scenarios.
- Layers: 2
- Parameters: ~4M
- Latency: ~3ms per pair
- Use case: High QPS, latency-critical
Accurate models (higher quality)
cross-encoder/ms-marco-MiniLM-L-12-v2
cross-encoder/ms-marco-MiniLM-L-12-v2
More accurate, slower. For offline evaluation or quality-critical applications.
- Layers: 12
- Parameters: ~33M
- Latency: ~20ms per pair
- Use case: Batch processing, high quality needs
BAAI/bge-reranker-v2-m3
BAAI/bge-reranker-v2-m3
Multilingual, high accuracy. For global applications.
- Layers: 12
- Parameters: ~568M
- Latency: ~50ms per pair
- Use case: Multilingual, maximum quality
Performance considerations
Candidate pool sizing
top_kcontrols the candidate pool size- Larger values improve reranking quality but increase latency
- Recommended: 5-10x
rerank_k - Example:
top_k=50, rerank_k=10
Latency optimization
Latency formulaQuality vs speed tradeoff
| Model size | top_k | rerank_k | Quality | Latency |
|---|---|---|---|---|
| TinyBERT-L-2 | 20 | 5 | Good | ~60ms |
| MiniLM-L-6 | 50 | 10 | Better | ~500ms |
| MiniLM-L-12 | 100 | 20 | Best | ~2000ms |
Use cases
High-precision search
When accuracy matters more than recall:Complex semantic queries
Queries requiring deep understanding:Domain-specific retrieval
Specialized domains with nuanced relevance:Evaluation metrics
Reranking performance can be measured with:- Contextual recall - Fraction of relevant docs in reranked results
- Precision@k - Accuracy of top-k reranked results
- NDCG@k - Normalized discounted cumulative gain
- MRR - Mean reciprocal rank of first relevant result
- Faithfulness - Alignment between reranked context and answers
Implementation details
How cross-encoders work
- Query-document attention
- Fine-grained token interactions
- Context-aware relevance scoring
Reranking helper
Related features
Semantic search
First-stage candidate retrieval
Hybrid search
Dense + sparse fusion before reranking
Contextual compression
Reduce context after reranking
MMR
Diversity-aware reranking