Skip to main content
Contextual compression reduces the token count of retrieved documents by filtering or extracting only the most relevant content. This addresses LLM context window limitations and reduces generation costs.

The problem with raw retrieval

Standard RAG retrieves top-k documents and passes them to the LLM. This creates issues:
  • Token limits: Long documents may exceed LLM context windows
  • Irrelevant content: Retrieved documents often contain off-topic sections
  • Cost: More tokens mean higher API costs for generation
  • Quality: Irrelevant content can distract the LLM from the answer
Contextual compression solves this by over-fetching documents, then compressing them to only relevant passages.

Compression strategies

Reranking-based compression

Uses cross-encoder models to score document relevance, then filters to the top-k most relevant.
from vectordb.haystack.contextual_compression.search import (
    PineconeContextualCompressionSearchPipeline
)

pipeline = PineconeContextualCompressionSearchPipeline(
    "configs/pinecone_compression.yaml"
)

results = pipeline.search(
    query="What is quantum entanglement?",
    top_k=5  # Returns 5 documents after compression
)

for doc in results["documents"]:
    print(doc.content[:200])
How it works:
  1. Retrieve top_k * 2 documents (over-fetch)
  2. Score each document with cross-encoder
  3. Return only top-k highest-scoring documents

LLM-based extraction

Uses an LLM to extract only relevant passages from retrieved documents.
from langchain_groq import ChatGroq
from vectordb.langchain.components import ContextCompressor
from langchain_core.documents import Document

llm = ChatGroq(model="llama-3.3-70b-versatile")
compressor = ContextCompressor(mode="llm_extraction", llm=llm)

documents = [
    Document(page_content="Long document with relevant and irrelevant sections..."),
    Document(page_content="Another document with mixed content...")
]

compressed = compressor.compress(
    query="What is photosynthesis?",
    documents=documents
)

# Returns a single document with extracted passages
print(compressed[0].page_content)
How it works:
  1. Retrieve top_k documents
  2. LLM reads all documents and extracts relevant passages
  3. Returns extracted content (higher compression ratio)

Configuration

compression:
  type: reranking
  reranker:
    type: cohere
    api_key: ${COHERE_API_KEY}
    model: rerank-english-v3.0
    top_k: 5

pinecone:
  api_key: ${PINECONE_API_KEY}
  index_name: documents
  namespace: default

embedding:
  provider: sentence_transformers
  model: all-MiniLM-L6-v2

Available rerankers

VectorDB supports multiple reranking backends:
reranker:
  type: cohere
  api_key: ${COHERE_API_KEY}
  model: rerank-english-v3.0
  top_k: 5
Best for: Production use, high quality Cost: API-based, per-request pricing

Compression metrics

The Haystack implementation tracks compression effectiveness:
from vectordb.haystack.contextual_compression.compression_utils import (
    TokenCounter
)

# Estimate tokens before and after compression
original_tokens = sum(
    TokenCounter.estimate_tokens(doc.content) for doc in original_docs
)
compressed_tokens = sum(
    TokenCounter.estimate_tokens(doc.content) for doc in compressed_docs
)

compression_ratio = compressed_tokens / original_tokens
tokens_saved = original_tokens - compressed_tokens

print(f"Compression ratio: {compression_ratio:.2%}")
print(f"Tokens saved: {tokens_saved}")
Example output:
Compression ratio: 30.5%
Tokens saved: 2,847

Implementation example

Here’s how Pinecone compression works under the hood (LangChain):
from vectordb.databases.pinecone import PineconeVectorDB
from vectordb.langchain.components import ContextCompressor
from vectordb.langchain.utils import EmbedderHelper, RAGHelper

class PineconeContextualCompressionSearchPipeline:
    def __init__(self, config):
        self.embedder = EmbedderHelper.create_embedder(config)
        self.db = PineconeVectorDB(
            api_key=config["pinecone"]["api_key"],
            index_name=config["pinecone"]["index_name"]
        )
        
        # Initialize compressor based on mode
        if config["compression"]["mode"] == "reranking":
            reranker = RerankerHelper.create_reranker(config)
            self.compressor = ContextCompressor(
                mode="reranking",
                reranker=reranker
            )
        else:
            llm = ChatGroq(model=config["compression"]["llm"]["model"])
            self.compressor = ContextCompressor(
                mode="llm_extraction",
                llm=llm
            )
    
    def search(self, query, top_k=10):
        # Step 1: Over-fetch documents
        query_embedding = EmbedderHelper.embed_query(self.embedder, query)
        retrieved = self.db.query(
            query_embedding=query_embedding,
            top_k=top_k * 2,  # Over-fetch
            namespace=self.namespace
        )
        
        # Step 2: Compress using reranker or LLM
        compressed = self.compressor.compress(
            query=query,
            documents=retrieved,
            top_k=top_k
        )
        
        return {"documents": compressed, "query": query}

Reranking algorithms

The compression utilities module documents different reranking approaches:
# Cross-encoder reranking (from compression_utils.py)
# Architecture: Joint encoding of query+document pairs
# How it works:
#   1. Concatenate query and document with [SEP] token
#   2. Pass through transformer encoder (BERT-like)
#   3. Output layer predicts relevance score (0-1)
# Benefits: Captures query-document interactions directly
# Trade-offs: Slower than bi-encoders (O(n) forward passes)

# Cohere API reranking
# Architecture: Cloud-based neural reranking service
# How it works:
#   1. Send query + batch of documents to Cohere API
#   2. Cohere's model computes relevance scores server-side
#   3. Returns ranked list with relevance scores (0-1)
# Benefits: No local GPU needed; constantly updated models
# Trade-offs: API latency, rate limits, cost per request

Cost comparison

  • Pros: Fast (~100ms), high quality, no local GPU
  • Cons: $2 per 1000 queries (1000 docs each)
  • Best for: Production with moderate query volume
  • Pros: Zero API cost, data stays local
  • Cons: Requires GPU, slower on CPU
  • Best for: High query volume or privacy requirements
  • Pros: Highest compression ratio (50-80%)
  • Cons: Adds LLM latency (~500ms), costs per query
  • Best for: Very long documents where token savings justify cost

When to use compression

Use reranking when

  • Documents are moderately long (500-2000 tokens)
  • You need fast compression (under 100ms)
  • Quality matters more than compression ratio
  • You want to preserve full document text

Use LLM extraction when

  • Documents are very long (>2000 tokens)
  • You need maximum compression (50-80% reduction)
  • Latency is acceptable (~500ms)
  • Extracted passages are sufficient for answers

Skip compression when

  • Documents are already short (under 500 tokens)
  • You have sufficient context window
  • Generation cost is not a concern
  • You need complete document text for citations

Combine both when

  • First: Rerank to filter irrelevant docs (fast)
  • Second: LLM extract passages from top docs (quality)
  • Result: Best of both - high quality, maximum compression

See also

Build docs developers (and LLMs) love