Traditional RAG vs. agentic RAG
- Traditional RAG
- Agentic RAG
Fixed pipeline:Limitations:
- Single retrieval pass (may miss information)
- No quality assessment
- Can’t adapt to complex queries
- No error correction
Core components
1. Query routing
The agent analyzes the query and selects the appropriate tool:- Retrieval: Factual lookups in vector database
- Web search: Current events or external information (placeholder)
- Calculation: Math or logic problems
- Reasoning: Complex multi-hop questions requiring synthesis
2. Self-reflection
After generating an answer, the agent evaluates quality and decides whether to iterate:3. Iterative refinement
The agent can loop up tomax_iterations times, refining the answer:
Basic usage
Configuration
Agentic workflow
Here’s how the Haystack pipeline implements the agentic loop:Tool handlers
Retrieval tool
Standard vector search with RAG generation:Calculation tool
Direct LLM reasoning for math/logic:Reasoning tool
Multi-hop retrieval with structured reasoning:Self-reflection implementation
The AgenticRouter scores answers and refines them iteratively:Runtime control
You can enable/disable routing and reflection at query time:Cost and latency
Routing overhead
Routing overhead
- LLM calls: 1 per query for tool selection
- Latency: ~200ms
- Cost: ~$0.0001 per query (Groq)
Self-reflection overhead
Self-reflection overhead
- LLM calls: 1-3 for scoring + refinement per iteration
- Latency: ~500ms per iteration
- Cost: ~$0.0003 per iteration
- Trade-off: Higher quality answers justify cost for complex queries
Total cost example
Total cost example
Traditional RAG:
- 1 retrieval + 1 generation = ~$0.0002
- 1 routing + 1 retrieval + 1 generation + 2 reflection iterations = ~$0.0008
When to use agentic RAG
Use agentic RAG when
- Queries are complex and multi-hop
- Answer quality matters more than latency
- You need self-correction for errors
- Questions may require calculation or reasoning
Use traditional RAG when
- Queries are simple and factual
- Latency is critical (under 500ms)
- Cost must be minimized
- Single retrieval pass is sufficient
Enable routing when
- Mixed query types (factual, math, reasoning)
- You want to avoid unnecessary retrieval
- Tool selection improves answer quality
Enable reflection when
- High-stakes answers (legal, medical)
- Complex questions benefit from refinement
- Initial answers are often incomplete
- Quality threshold must be met
Production tips
Tune quality threshold
- Start with 75 (default)
- Lower (60-70) for faster responses with acceptable quality
- Higher (80-90) for critical applications
- Monitor score distribution to calibrate
Limit max iterations
- Default: 3 iterations
- Lower (1-2) to control cost
- Higher (4-5) for very complex queries
- Timeout after max iterations regardless of score
Cache routing decisions
- Similar queries often route to the same tool
- Cache query → tool mappings
- Reduces routing LLM calls by ~60%
See also
- Query enhancement - Improve retrieval within agentic loop
- Contextual compression - Reduce context before generation
- Cost optimization - Budget-aware agentic RAG strategies