Skip to main content
Agentic RAG introduces an LLM-based decision-making loop where an agent controls the retrieval process. Instead of a fixed pipeline, the agent chooses actions based on query analysis, retrieved content quality, and answer evaluation.

Traditional RAG vs. agentic RAG

Fixed pipeline:
Query → Embed → Retrieve → Generate → Answer
Limitations:
  • Single retrieval pass (may miss information)
  • No quality assessment
  • Can’t adapt to complex queries
  • No error correction

Core components

1. Query routing

The agent analyzes the query and selects the appropriate tool:
  • Retrieval: Factual lookups in vector database
  • Web search: Current events or external information (placeholder)
  • Calculation: Math or logic problems
  • Reasoning: Complex multi-hop questions requiring synthesis
from vectordb.haystack.components import AgenticRouter

router = AgenticRouter(
    model="llama-3.3-70b-versatile",
    api_key=os.getenv("GROQ_API_KEY")
)

# Agent decides which tool to use
tool = router.select_tool("What is 15% of 240?")
print(tool)  # "calculation"

tool = router.select_tool("What is photosynthesis?")
print(tool)  # "retrieval"

2. Self-reflection

After generating an answer, the agent evaluates quality and decides whether to iterate:
# Agent scores answer quality (0-100)
score = router.score_answer(
    query="What is quantum entanglement?",
    answer="Quantum entanglement is a phenomenon where particles...",
    context="Document 1: ... Document 2: ..."
)

print(f"Quality score: {score}")

if score < quality_threshold:
    # Retrieve more documents or refine answer
    print("Score too low, iterating...")

3. Iterative refinement

The agent can loop up to max_iterations times, refining the answer:
refined_answer = router.self_reflect_loop(
    query=query,
    answer=initial_answer,
    context=retrieved_docs,
    max_iterations=3,
    quality_threshold=75
)

Basic usage

from vectordb.haystack.agentic_rag import PineconeAgenticRAGPipeline

pipeline = PineconeAgenticRAGPipeline("configs/pinecone_agentic.yaml")

# Load and index dataset
pipeline.load_dataset(dataset_type="triviaqa", limit=1000)
pipeline.index_documents()

# Run agentic RAG
result = pipeline.run(
    query="Explain how neural networks learn from data",
    top_k=10,
    enable_routing=True,
    enable_self_reflection=True
)

print(f"Tool used: {result['tool']}")
print(f"Answer: {result['answer']}")
print(f"Refined: {result.get('refined', False)}")

Configuration

pinecone:
  api_key: ${PINECONE_API_KEY}
  index_name: agentic-rag
  namespace: default

embeddings:
  model: sentence-transformers/all-MiniLM-L6-v2
  batch_size: 32

agentic_rag:
  model: llama-3.3-70b-versatile
  api_key: ${GROQ_API_KEY}
  routing_enabled: true
  self_reflection_enabled: true
  max_iterations: 3
  quality_threshold: 75
  reflection_context_top_k: 3

generator:
  model: llama-3.3-70b-versatile
  api_key: ${GROQ_API_KEY}
  api_base_url: https://api.groq.com/openai/v1
  max_tokens: 2048

retrieval:
  top_k_default: 10
  context_top_k: 5

dataloader:
  type: triviaqa
  split: test
  limit: 1000

Agentic workflow

Here’s how the Haystack pipeline implements the agentic loop:
class BaseAgenticRAGPipeline:
    def run(self, query, top_k=10, enable_routing=None, enable_self_reflection=None):
        # Step 1: Route query to appropriate tool
        tool = self.router.select_tool(query) if routing else "retrieval"
        
        # Step 2: Execute selected tool
        if tool == "retrieval":
            result = self._handle_retrieval(query, top_k)
        elif tool == "calculation":
            result = self._handle_calculation(query)
        elif tool == "reasoning":
            result = self._handle_reasoning(query, top_k)
        else:
            result = self._handle_retrieval(query, top_k)  # Fallback
        
        # Step 3: Self-reflection and iterative refinement
        if reflection and result.get("answer"):
            context = "\n".join([
                doc.content for doc in result["documents"][:3]
            ])
            
            refined_answer = self.router.self_reflect_loop(
                query=query,
                answer=result["answer"],
                context=context,
                max_iterations=self._get_max_iterations(),
                quality_threshold=self._get_quality_threshold()
            )
            
            result["answer"] = refined_answer
            result["refined"] = True
        
        return result

Tool handlers

Retrieval tool

Standard vector search with RAG generation:
def _handle_retrieval(self, query: str, top_k: int):
    # Retrieve relevant documents from vector database
    documents = self._retrieve(query, top_k)
    
    # Generate answer using LLM with retrieved context
    answer = self._generate_answer(query, documents)
    
    return {"documents": documents, "answer": answer, "tool": "retrieval"}

Calculation tool

Direct LLM reasoning for math/logic:
def _handle_calculation(self, query: str):
    prompt = f"""Solve this problem step by step:

{query}

Provide the calculation steps and final answer."""
    
    result = self.generator.run(prompt=prompt)
    answer = result.get("replies", [""])[0]
    
    return {"documents": [], "answer": answer, "tool": "calculation"}

Reasoning tool

Multi-hop retrieval with structured reasoning:
def _handle_reasoning(self, query: str, top_k: int):
    # Retrieve context for reasoning
    documents = self._retrieve(query, top_k)
    context = "\n\n".join([doc.content for doc in documents[:5]])
    
    # Structured reasoning prompt
    prompt = f"""You are a helpful assistant that answers questions using step-by-step reasoning.

Context:
{context}

Question: {query}

Think through this step-by-step:
1. First, identify what information is relevant
2. Then, analyze the key points
3. Finally, synthesize into a comprehensive answer

Answer:"""
    
    result = self.generator.run(prompt=prompt)
    answer = result.get("replies", [""])[0]
    
    return {"documents": documents, "answer": answer, "tool": "reasoning"}

Self-reflection implementation

The AgenticRouter scores answers and refines them iteratively:
class AgenticRouter:
    def self_reflect_loop(
        self,
        query: str,
        answer: str,
        context: str,
        max_iterations: int = 3,
        quality_threshold: int = 75
    ) -> str:
        current_answer = answer
        
        for iteration in range(max_iterations):
            # Score current answer
            score = self.score_answer(query, current_answer, context)
            
            if score >= quality_threshold:
                return current_answer  # Good enough
            
            # Refine answer
            current_answer = self._refine_answer(
                query, current_answer, context, score
            )
        
        return current_answer  # Return best attempt
    
    def score_answer(self, query: str, answer: str, context: str) -> int:
        prompt = f"""Rate the quality of this answer on a scale of 0-100:

Question: {query}
Context: {context}
Answer: {answer}

Consider:
- Completeness: Does it address all parts of the question?
- Accuracy: Is it supported by the context?
- Clarity: Is it well-structured?

Score (0-100):"""
        
        response = self.generator.run(prompt=prompt)
        # Parse score from response
        score = self._parse_score(response)
        return score

Runtime control

You can enable/disable routing and reflection at query time:
# No routing, no reflection (traditional RAG)
result = pipeline.run(
    query="What is AI?",
    enable_routing=False,
    enable_self_reflection=False
)

# Routing only (single-pass with tool selection)
result = pipeline.run(
    query="Calculate 15% of 240",
    enable_routing=True,
    enable_self_reflection=False
)

# Full agentic (routing + reflection)
result = pipeline.run(
    query="Explain quantum computing",
    enable_routing=True,
    enable_self_reflection=True
)

Cost and latency

  • LLM calls: 1 per query for tool selection
  • Latency: ~200ms
  • Cost: ~$0.0001 per query (Groq)
  • LLM calls: 1-3 for scoring + refinement per iteration
  • Latency: ~500ms per iteration
  • Cost: ~$0.0003 per iteration
  • Trade-off: Higher quality answers justify cost for complex queries
Traditional RAG:
  • 1 retrieval + 1 generation = ~$0.0002
Agentic RAG (with reflection):
  • 1 routing + 1 retrieval + 1 generation + 2 reflection iterations = ~$0.0008
4x cost, but significantly better quality for complex queries

When to use agentic RAG

Use agentic RAG when

  • Queries are complex and multi-hop
  • Answer quality matters more than latency
  • You need self-correction for errors
  • Questions may require calculation or reasoning

Use traditional RAG when

  • Queries are simple and factual
  • Latency is critical (under 500ms)
  • Cost must be minimized
  • Single retrieval pass is sufficient

Enable routing when

  • Mixed query types (factual, math, reasoning)
  • You want to avoid unnecessary retrieval
  • Tool selection improves answer quality

Enable reflection when

  • High-stakes answers (legal, medical)
  • Complex questions benefit from refinement
  • Initial answers are often incomplete
  • Quality threshold must be met

Production tips

1

Tune quality threshold

  • Start with 75 (default)
  • Lower (60-70) for faster responses with acceptable quality
  • Higher (80-90) for critical applications
  • Monitor score distribution to calibrate
2

Limit max iterations

  • Default: 3 iterations
  • Lower (1-2) to control cost
  • Higher (4-5) for very complex queries
  • Timeout after max iterations regardless of score
3

Cache routing decisions

  • Similar queries often route to the same tool
  • Cache query → tool mappings
  • Reduces routing LLM calls by ~60%
4

Monitor tool usage

  • Track which tools are used most
  • Optimize retrieval pipeline for high-usage tools
  • Disable unused tools to simplify routing

See also

Build docs developers (and LLMs) love