The problem with raw retrieval
Standard RAG retrieves top-k documents and passes them to the LLM. This creates issues:- Token limits: Long documents may exceed LLM context windows
- Irrelevant content: Retrieved documents often contain off-topic sections
- Cost: More tokens mean higher API costs for generation
- Quality: Irrelevant content can distract the LLM from the answer
Compression strategies
Reranking-based compression
Uses cross-encoder models to score document relevance, then filters to the top-k most relevant.- Retrieve
top_k * 2documents (over-fetch) - Score each document with cross-encoder
- Return only top-k highest-scoring documents
LLM-based extraction
Uses an LLM to extract only relevant passages from retrieved documents.- Retrieve
top_kdocuments - LLM reads all documents and extracts relevant passages
- Returns extracted content (higher compression ratio)
Configuration
Available rerankers
VectorDB supports multiple reranking backends:- Cohere
- Cross-encoder
- Voyage AI
- BGE
Compression metrics
The Haystack implementation tracks compression effectiveness:Implementation example
Here’s how Pinecone compression works under the hood (LangChain):Reranking algorithms
The compression utilities module documents different reranking approaches:Cost comparison
Reranking (Cohere)
Reranking (Cohere)
- Pros: Fast (~100ms), high quality, no local GPU
- Cons: $2 per 1000 queries (1000 docs each)
- Best for: Production with moderate query volume
Reranking (local cross-encoder)
Reranking (local cross-encoder)
- Pros: Zero API cost, data stays local
- Cons: Requires GPU, slower on CPU
- Best for: High query volume or privacy requirements
LLM extraction
LLM extraction
- Pros: Highest compression ratio (50-80%)
- Cons: Adds LLM latency (~500ms), costs per query
- Best for: Very long documents where token savings justify cost
When to use compression
Use reranking when
- Documents are moderately long (500-2000 tokens)
- You need fast compression (under 100ms)
- Quality matters more than compression ratio
- You want to preserve full document text
Use LLM extraction when
- Documents are very long (>2000 tokens)
- You need maximum compression (50-80% reduction)
- Latency is acceptable (~500ms)
- Extracted passages are sufficient for answers
Skip compression when
- Documents are already short (under 500 tokens)
- You have sufficient context window
- Generation cost is not a concern
- You need complete document text for citations
Combine both when
- First: Rerank to filter irrelevant docs (fast)
- Second: LLM extract passages from top docs (quality)
- Result: Best of both - high quality, maximum compression
See also
- Query enhancement - Improve retrieval recall
- Reranking - Two-stage retrieval details
- Cost optimization - Budget-aware retrieval strategies