Skip to main content
Diversity filtering post-processes search results to ensure the returned documents cover different aspects of the query, reducing redundancy and improving information coverage. When search returns many near-duplicates, diversity filtering selects representative documents to maximize information coverage.

How it works

Diversity filtering over-fetches candidates from the vector database, then applies post-processing to select diverse results.

Pipeline architecture

  1. Query embedding - Convert query text to dense vector
  2. Over-fetch - Retrieve 3x top_k candidates from database
  3. Re-embedding - Generate embeddings for retrieved documents
  4. Diversity filtering - Apply MMR or clustering method
  5. Limit - Return top_k diverse documents
  6. Optional RAG - Generate answer using diverse documents

Why diversity matters

Standard semantic search returns the k most similar documents, which often results in redundant information (e.g., 5 similar paragraphs from the same source). Diversity filtering ensures results cover different perspectives, sources, or aspects of the query topic.

Diversity methods

MMR (Maximal Marginal Relevance)

How it works: Balances query relevance with inter-document diversity using lambda parameter.Formula: MMR(d) = λ × sim(d, query) - (1-λ) × max_sim(d, selected)Configuration:
  • max_documents - Maximum documents to return
  • lambda_param - Relevance-diversity trade-off (default: 0.5)
Best for: Retrieval where both relevance and diversity matterSpeed: Fast (greedy algorithm)

Clustering-based

How it works: Groups retrieved documents into N clusters using embeddings, then samples M documents from each cluster.Configuration:
  • num_clusters - Number of topic clusters (default: 3)
  • samples_per_cluster - Docs per cluster (default: 2)
Best for: Ensuring coverage of distinct topic areasSpeed: Moderate (K-means clustering)

Key features

  • Two diversity methods: MMR and clustering-based selection
  • Over-fetching with configurable multiplier (default 3x)
  • Re-embedding ensures consistent similarity calculations
  • Works with all vector databases
  • Optional RAG integration

Implementation

from vectordb.langchain.diversity_filtering import PineconeDiversityFilteringSearchPipeline

pipeline = PineconeDiversityFilteringSearchPipeline("config.yaml")
results = pipeline.search(
    query="machine learning applications",
    top_k=5,
)

for doc in results["documents"]:
    print(f"Diverse result: {doc.page_content[:100]}...")

Configuration

Required settings

pinecone.api_key
string
required
Vector database API authentication
pinecone.index_name
string
required
Target index name for search

Diversity configuration

diversity.method
string
default:"mmr"
Diversity method: "mmr" or "clustering"
diversity.candidate_multiplier
integer
default:3
Over-fetch multiplier (retrieves top_k × multiplier candidates)

MMR-specific

diversity.max_documents
integer
default:10
Maximum documents to return for MMR method
diversity.lambda_param
float
Relevance-diversity trade-off (0.0-1.0)
  • 1.0 = pure relevance
  • 0.5 = balanced
  • 0.0 = pure diversity

Clustering-specific

diversity.num_clusters
integer
default:3
Number of clusters for clustering method
diversity.samples_per_cluster
integer
default:2
Documents to sample from each cluster

Example configurations

pinecone:
  api_key: "${PINECONE_API_KEY}"
  index_name: "diversity-search"
  namespace: "production"

embedder:
  model_name: "all-MiniLM-L6-v2"

diversity:
  method: "mmr"
  lambda_param: 0.5
  max_documents: 10
  candidate_multiplier: 3

rag:
  enabled: true
  generator_model: "gpt-4o-mini"

Search parameters

query
string
required
Search query text to embed and match against documents
top_k
integer
default:10
Number of diverse documents to return. Pipeline retrieves 3x this amount for diversity selection.
filters
dict
Optional metadata filters to apply during retrieval

Use cases

When users need to see different perspectives:
results = pipeline.search(
    query="climate change impacts",
    top_k=10,
)
# Returns diverse perspectives: economic, environmental, social, etc.

Multi-document summarization

Provide diverse context to LLMs:
pipeline = PineconeDiversityFilteringSearchPipeline({
    "pinecone": {"api_key": "...", "index_name": "..."},
    "diversity": {"method": "mmr", "lambda_param": 0.4},
    "rag": {"enabled": True},
})

results = pipeline.search(
    query="summarize AI safety research",
    top_k=5,
)
print(results["answer"])  # Summary based on diverse sources

News aggregation

Show articles from different sources:
results = pipeline.search(
    query="latest quantum computing breakthroughs",
    top_k=8,
    filters={"content_type": "news"},
)
# Returns articles from diverse sources, avoiding duplicate coverage

Research literature review

Cover different research approaches:
pipeline = QdrantDiversityFilteringSearchPipeline({
    "diversity": {
        "method": "clustering",
        "num_clusters": 5,
        "samples_per_cluster": 2,
    },
})

results = pipeline.search(
    query="neural architecture search methods",
    top_k=10,
)
# Returns papers covering different NAS approaches

Method comparison

AspectMMRClustering
Query awarenessYes, uses query similarityNo, only inter-doc similarity
SpeedFast (greedy)Moderate (K-means)
Parameterslambda_param, max_documentsnum_clusters, samples_per_cluster
Best forRelevance + diversity balanceTopic coverage
DeterministicYesNo (K-means random init)
InterpretabilityHigh (clear trade-off)Medium (cluster interpretation)

Choosing a method

1

Default: Use MMR

MMR is query-aware and provides explicit relevance-diversity control. Start with lambda_param=0.5.
2

Topic coverage: Use clustering

When you need guaranteed coverage of N distinct topics, use clustering with num_clusters=N.
3

Tune parameters

  • MMR: Adjust lambda (↑ relevance, ↓ diversity)
  • Clustering: Adjust num_clusters and samples_per_cluster
4

Evaluate

Measure diversity with metrics like average pairwise similarity or topic coverage.

Diversity helpers

The diversity filtering pipeline uses helper methods that can be used independently:
from vectordb.langchain.diversity_filtering.helpers import DiversityFilteringHelper

# MMR diversification
diverse_docs = DiversityFilteringHelper.mmr_diversify(
    documents=retrieved_docs,
    embeddings=doc_embeddings,
    query_embedding=query_embedding,
    max_documents=10,
    lambda_param=0.5,
)

# Clustering diversification
diverse_docs = DiversityFilteringHelper.clustering_diversify(
    documents=retrieved_docs,
    embeddings=doc_embeddings,
    num_clusters=3,
    samples_per_cluster=2,
)

Over-fetching strategy

Over-fetching provides the diversity algorithm with more options for selecting diverse results. A 3x multiplier is recommended - it provides enough candidates without excessive latency.
Example:
top_k = 10
candidate_multiplier = 3
retrieved_count = 30  # 10 × 3

# Retrieve 30 candidates, select 10 diverse ones
Trade-offs:
  • Higher multiplier - More diversity options, higher latency
  • Lower multiplier - Faster, but limited diversity options

Performance considerations

Time complexity

  • MMR: O(k × n) where k=top_k, n=candidates
  • Clustering: O(n × d × iterations) for K-means

Optimization tips

  1. Cache embeddings - Store document embeddings to avoid recomputation
  2. Limit over-fetch - Balance diversity quality with latency (3-5x multiplier)
  3. Use metadata filters - Reduce candidate pool before diversity filtering
  4. Batch processing - Process multiple queries together for efficiency

Evaluation metrics

Measure diversity effectiveness:

Average pairwise similarity

import numpy as np
from vectordb.langchain.diversity_filtering.helpers import DiversityFilteringHelper

# Calculate average cosine similarity between all pairs
similarities = []
for i in range(len(embeddings)):
    for j in range(i+1, len(embeddings)):
        sim = DiversityFilteringHelper.cosine_similarity(
            embeddings[i], embeddings[j]
        )
        similarities.append(sim)

avg_similarity = np.mean(similarities)
print(f"Average pairwise similarity: {avg_similarity:.3f}")
# Lower is more diverse

Topic coverage

Count distinct topics/sources in results:
topics = set(doc.metadata.get("topic") for doc in results["documents"])
print(f"Topic coverage: {len(topics)} distinct topics")

MMR

Maximal marginal relevance algorithm details

Semantic search

Initial retrieval before diversity filtering

Hybrid search

Dense + sparse retrieval

Reranking

Cross-encoder second-stage scoring

Build docs developers (and LLMs) love