Agentic RAG

Agentic RAG introduces an LLM-based decision-making loop where an agent controls the retrieval process. Instead of a fixed pipeline, the agent chooses actions based on query analysis, retrieved content quality, and answer evaluation.

Traditional RAG vs. agentic RAG

Traditional RAG
Agentic RAG

Fixed pipeline:

Query → Embed → Retrieve → Generate → Answer

Limitations:

Single retrieval pass (may miss information)
No quality assessment
Can’t adapt to complex queries
No error correction

Adaptive loop:

Query → Route → [Search | Reflect | Reason | Calculate] → Evaluate → Iterate → Answer

Benefits:

Multiple retrieval passes if needed
Quality-based iteration
Adapts to query complexity
Self-corrects poor answers

Core components

1. Query routing

The agent analyzes the query and selects the appropriate tool:

Retrieval: Factual lookups in vector database
Web search: Current events or external information (placeholder)
Calculation: Math or logic problems
Reasoning: Complex multi-hop questions requiring synthesis

from vectordb.haystack.components import AgenticRouter

router = AgenticRouter(
    model="llama-3.3-70b-versatile",
    api_key=os.getenv("GROQ_API_KEY")
)

# Agent decides which tool to use
tool = router.select_tool("What is 15% of 240?")
print(tool)  # "calculation"

tool = router.select_tool("What is photosynthesis?")
print(tool)  # "retrieval"

2. Self-reflection

After generating an answer, the agent evaluates quality and decides whether to iterate:

# Agent scores answer quality (0-100)
score = router.score_answer(
    query="What is quantum entanglement?",
    answer="Quantum entanglement is a phenomenon where particles...",
    context="Document 1: ... Document 2: ..."
)

print(f"Quality score: {score}")

if score < quality_threshold:
    # Retrieve more documents or refine answer
    print("Score too low, iterating...")

The agent can loop up to max_iterations times, refining the answer:

refined_answer = router.self_reflect_loop(
    query=query,
    answer=initial_answer,
    context=retrieved_docs,
    max_iterations=3,
    quality_threshold=75
)

Basic usage

from vectordb.haystack.agentic_rag import PineconeAgenticRAGPipeline

pipeline = PineconeAgenticRAGPipeline("configs/pinecone_agentic.yaml")

# Load and index dataset
pipeline.load_dataset(dataset_type="triviaqa", limit=1000)
pipeline.index_documents()

# Run agentic RAG
result = pipeline.run(
    query="Explain how neural networks learn from data",
    top_k=10,
    enable_routing=True,
    enable_self_reflection=True
)

print(f"Tool used: {result['tool']}")
print(f"Answer: {result['answer']}")
print(f"Refined: {result.get('refined', False)}")

Configuration

pinecone:
  api_key: ${PINECONE_API_KEY}
  index_name: agentic-rag
  namespace: default

embeddings:
  model: sentence-transformers/all-MiniLM-L6-v2
  batch_size: 32

agentic_rag:
  model: llama-3.3-70b-versatile
  api_key: ${GROQ_API_KEY}
  routing_enabled: true
  self_reflection_enabled: true
  max_iterations: 3
  quality_threshold: 75
  reflection_context_top_k: 3

generator:
  model: llama-3.3-70b-versatile
  api_key: ${GROQ_API_KEY}
  api_base_url: https://api.groq.com/openai/v1
  max_tokens: 2048

retrieval:
  top_k_default: 10
  context_top_k: 5

dataloader:
  type: triviaqa
  split: test
  limit: 1000

Agentic workflow

Here’s how the Haystack pipeline implements the agentic loop:

class BaseAgenticRAGPipeline:
    def run(self, query, top_k=10, enable_routing=None, enable_self_reflection=None):
        # Step 1: Route query to appropriate tool
        tool = self.router.select_tool(query) if routing else "retrieval"
        
        # Step 2: Execute selected tool
        if tool == "retrieval":
            result = self._handle_retrieval(query, top_k)
        elif tool == "calculation":
            result = self._handle_calculation(query)
        elif tool == "reasoning":
            result = self._handle_reasoning(query, top_k)
        else:
            result = self._handle_retrieval(query, top_k)  # Fallback
        
        # Step 3: Self-reflection and iterative refinement
        if reflection and result.get("answer"):
            context = "\n".join([
                doc.content for doc in result["documents"][:3]
            ])
            
            refined_answer = self.router.self_reflect_loop(
                query=query,
                answer=result["answer"],
                context=context,
                max_iterations=self._get_max_iterations(),
                quality_threshold=self._get_quality_threshold()
            )
            
            result["answer"] = refined_answer
            result["refined"] = True
        
        return result

Tool handlers

Retrieval tool

Standard vector search with RAG generation:

def _handle_retrieval(self, query: str, top_k: int):
    # Retrieve relevant documents from vector database
    documents = self._retrieve(query, top_k)
    
    # Generate answer using LLM with retrieved context
    answer = self._generate_answer(query, documents)
    
    return {"documents": documents, "answer": answer, "tool": "retrieval"}

Calculation tool

Direct LLM reasoning for math/logic:

def _handle_calculation(self, query: str):
    prompt = f"""Solve this problem step by step:

{query}

Provide the calculation steps and final answer."""
    
    result = self.generator.run(prompt=prompt)
    answer = result.get("replies", [""])[0]
    
    return {"documents": [], "answer": answer, "tool": "calculation"}

Reasoning tool

Multi-hop retrieval with structured reasoning:

def _handle_reasoning(self, query: str, top_k: int):
    # Retrieve context for reasoning
    documents = self._retrieve(query, top_k)
    context = "\n\n".join([doc.content for doc in documents[:5]])
    
    # Structured reasoning prompt
    prompt = f"""You are a helpful assistant that answers questions using step-by-step reasoning.

Context:
{context}

Question: {query}

Think through this step-by-step:
1. First, identify what information is relevant
2. Then, analyze the key points
3. Finally, synthesize into a comprehensive answer

Answer:"""
    
    result = self.generator.run(prompt=prompt)
    answer = result.get("replies", [""])[0]
    
    return {"documents": documents, "answer": answer, "tool": "reasoning"}

Self-reflection implementation

The AgenticRouter scores answers and refines them iteratively:

class AgenticRouter:
    def self_reflect_loop(
        self,
        query: str,
        answer: str,
        context: str,
        max_iterations: int = 3,
        quality_threshold: int = 75
    ) -> str:
        current_answer = answer
        
        for iteration in range(max_iterations):
            # Score current answer
            score = self.score_answer(query, current_answer, context)
            
            if score >= quality_threshold:
                return current_answer  # Good enough
            
            # Refine answer
            current_answer = self._refine_answer(
                query, current_answer, context, score
            )
        
        return current_answer  # Return best attempt
    
    def score_answer(self, query: str, answer: str, context: str) -> int:
        prompt = f"""Rate the quality of this answer on a scale of 0-100:

Question: {query}
Context: {context}
Answer: {answer}

Consider:
- Completeness: Does it address all parts of the question?
- Accuracy: Is it supported by the context?
- Clarity: Is it well-structured?

Score (0-100):"""
        
        response = self.generator.run(prompt=prompt)
        # Parse score from response
        score = self._parse_score(response)
        return score

Runtime control

You can enable/disable routing and reflection at query time:

# No routing, no reflection (traditional RAG)
result = pipeline.run(
    query="What is AI?",
    enable_routing=False,
    enable_self_reflection=False
)

# Routing only (single-pass with tool selection)
result = pipeline.run(
    query="Calculate 15% of 240",
    enable_routing=True,
    enable_self_reflection=False
)

# Full agentic (routing + reflection)
result = pipeline.run(
    query="Explain quantum computing",
    enable_routing=True,
    enable_self_reflection=True
)

Cost and latency

Routing overhead

LLM calls: 1 per query for tool selection
Latency: ~200ms
Cost: ~$0.0001 per query (Groq)

Self-reflection overhead

LLM calls: 1-3 for scoring + refinement per iteration
Latency: ~500ms per iteration
Cost: ~$0.0003 per iteration
Trade-off: Higher quality answers justify cost for complex queries

Total cost example

Traditional RAG:

1 retrieval + 1 generation = ~$0.0002

Agentic RAG (with reflection):

1 routing + 1 retrieval + 1 generation + 2 reflection iterations = ~$0.0008

4x cost, but significantly better quality for complex queries

When to use agentic RAG

Use agentic RAG when

Queries are complex and multi-hop
Answer quality matters more than latency
You need self-correction for errors
Questions may require calculation or reasoning

Use traditional RAG when

Queries are simple and factual
Latency is critical (under 500ms)
Cost must be minimized
Single retrieval pass is sufficient

Enable routing when

Mixed query types (factual, math, reasoning)
You want to avoid unnecessary retrieval
Tool selection improves answer quality

Enable reflection when

High-stakes answers (legal, medical)
Complex questions benefit from refinement
Initial answers are often incomplete
Quality threshold must be met

Production tips

Tune quality threshold

Start with 75 (default)
Lower (60-70) for faster responses with acceptable quality
Higher (80-90) for critical applications
Monitor score distribution to calibrate

Limit max iterations

Default: 3 iterations
Lower (1-2) to control cost
Higher (4-5) for very complex queries
Timeout after max iterations regardless of score

Cache routing decisions

Similar queries often route to the same tool
Cache query → tool mappings
Reduces routing LLM calls by ~60%

Monitor tool usage

Track which tools are used most
Optimize retrieval pipeline for high-usage tools
Disable unused tools to simplify routing

Getting Started

Core Concepts

Vector Databases

Retrieval Features

Advanced RAG

Data Management

Traditional RAG vs. agentic RAG

Core components

1. Query routing

2. Self-reflection

3. Iterative refinement

Basic usage

Configuration

Agentic workflow

Tool handlers

Retrieval tool

Calculation tool

Reasoning tool

Self-reflection implementation

Runtime control

Cost and latency

When to use agentic RAG

Use agentic RAG when

Use traditional RAG when

Enable routing when

Enable reflection when

Production tips

See also

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Vector Databases

Retrieval Features

Advanced RAG

Data Management

​Traditional RAG vs. agentic RAG

​Core components

​1. Query routing

​2. Self-reflection

​3. Iterative refinement

​Basic usage

​Configuration

​Agentic workflow

​Tool handlers

​Retrieval tool

​Calculation tool

​Reasoning tool

​Self-reflection implementation

​Runtime control

​Cost and latency

​When to use agentic RAG

Use agentic RAG when

Use traditional RAG when

Enable routing when

Enable reflection when

​Production tips

​See also

Build docs developers (and LLMs) love

Traditional RAG vs. agentic RAG

Core components

1. Query routing

2. Self-reflection

3. Iterative refinement

Basic usage

Configuration

Agentic workflow

Tool handlers

Retrieval tool

Calculation tool

Reasoning tool

Self-reflection implementation

Runtime control

Cost and latency

When to use agentic RAG

Production tips

See also