Skip to main content
Agentic RAG introduces iterative decision-making into the retrieval pipeline. Instead of a single retrieve-then-generate pass, the system evaluates its candidate answer and decides whether to retrieve more evidence, reflect on the answer, or finalize — repeating until quality is acceptable or the iteration limit is reached.

How it works

The pipeline uses AgenticRouter (from components/agentic_router.py) as the central decision-maker. The router is a ChatGroq LLM that returns structured JSON decisions at each step.

Decision loop

Router decisions

The AgenticRouter.route() method receives:
  • query: the original user question
  • has_documents: whether documents have already been retrieved
  • current_answer: the current draft answer (if any)
  • iteration: current iteration number (1-indexed)
  • max_iterations: the hard loop limit
It returns {"action": "search|reflect|generate", "reasoning": "..."}. When iteration >= max_iterations, the router automatically returns "generate" regardless of quality — this is the safety mechanism that prevents infinite loops.

State machine

Pipeline implementation

The agentic RAG pipeline orchestrates the router, retriever, and generator:
src/vectordb/langchain/agentic_rag/base.py
from langchain_groq import ChatGroq
from vectordb.langchain.components import AgenticRouter
from vectordb.langchain.utils import EmbedderHelper, RAGHelper

class AgenticRAGPipeline:
    """Agentic RAG pipeline with iterative retrieval and reflection."""

    def __init__(self, config: dict) -> None:
        self.config = config
        
        # Initialize components
        self.embedder = EmbedderHelper.create_embedder(config)
        self.llm = RAGHelper.create_llm(config)
        
        # Create router with deterministic LLM
        router_llm = ChatGroq(
            model=config.get("agentic_rag", {}).get("model", "llama-3.3-70b-versatile"),
            temperature=0.0,  # Deterministic routing
        )
        self.router = AgenticRouter(router_llm)
        
        self.max_iterations = config.get("agentic_rag", {}).get("max_iterations", 3)

    def run(self, query: str) -> dict:
        """Execute agentic RAG pipeline.
        
        Args:
            query: User's question.
        
        Returns:
            Dictionary with final answer and iteration history.
        """
        documents = []
        current_answer = None
        iteration = 1
        history = []

        while iteration <= self.max_iterations:
            # Route to next action
            decision = self.router.route(
                query=query,
                has_documents=len(documents) > 0,
                current_answer=current_answer,
                iteration=iteration,
                max_iterations=self.max_iterations,
            )

            action = decision["action"]
            reasoning = decision["reasoning"]
            history.append({"iteration": iteration, "action": action, "reasoning": reasoning})

            if action == "search":
                # Retrieve documents from vector store
                new_docs = self.retrieve(query)
                documents.extend(new_docs)
                
                # Generate draft answer
                if self.llm:
                    current_answer = RAGHelper.generate(self.llm, query, documents)

            elif action == "reflect":
                # Evaluate and improve current answer
                if self.llm and current_answer:
                    current_answer = self.reflect_and_improve(
                        query, current_answer, documents
                    )

            elif action == "generate":
                # Finalize and return answer
                break

            iteration += 1

        # Generate final answer if not already done
        if not current_answer and self.llm and documents:
            current_answer = RAGHelper.generate(self.llm, query, documents)

        return {
            "query": query,
            "answer": current_answer or "No answer generated",
            "documents": documents,
            "iterations": iteration,
            "history": history,
        }

    def retrieve(self, query: str) -> list:
        """Retrieve documents from vector store."""
        query_embedding = EmbedderHelper.embed_query(self.embedder, query)
        # Vector store retrieval logic
        return []

    def reflect_and_improve(self, query: str, answer: str, documents: list) -> str:
        """Reflect on answer quality and improve."""
        reflection_prompt = f"""Given the question and current answer, identify gaps or inaccuracies.

Question: {query}
Current Answer: {answer}

What is missing or incorrect?"""
        
        reflection = self.llm.invoke(reflection_prompt).content
        
        improvement_prompt = f"""Improve the answer based on the reflection.

Question: {query}
Current Answer: {answer}
Reflection: {reflection}

Improved Answer:"""
        
        improved = self.llm.invoke(improvement_prompt).content
        return improved

Configuration

agentic_rag:
  max_iterations: 3               # Hard iteration cap
  model: "llama-3.3-70b-versatile"

search:
  top_k: 10

llm:
  api_key: "${GROQ_API_KEY}"
  temperature: 0.0                # Deterministic routing

rag:
  enabled: true
  temperature: 0.7                # Creative generation

When to use it

  • Complex multi-hop questions where the answer requires combining information from multiple retrieval passes
  • High-stakes applications where answer completeness and grounding must be verified before delivery
  • Workflows where single-pass retrieval and generation consistently underperforms on hard questions

When not to use it

  • Strict low-latency endpoints where each LLM call (routing + reflection) adds hundreds of milliseconds
  • Simple factual lookups where one-shot retrieval consistently finds the answer

Tradeoffs

DimensionWhat to expect
QualityPotentially the highest on complex tasks requiring multiple evidence sources
LatencyHighest of all features — each iteration adds two or more LLM calls
CostHighest — routing, retrieval, and reflection each consume LLM tokens

Settings to tune first

agentic_rag.max_iterations
integer
default:"3"
Start with 2; increase only if quality does not converge at lower values.Without a hard loop cap, the router can cycle indefinitely. Always set a finite limit.
llm.temperature
float
default:"0.0"
Use 0.0 for routing (deterministic decisions); 0.7 for answer generation (creative).
agentic_rag.model
string
default:"llama-3.3-70b-versatile"
The routing template directly determines how well the LLM navigates the state machine.Recommended: llama-3.3-70b-versatile or gpt-4

Router prompt template

The AgenticRouter.ROUTING_TEMPLATE is a structured prompt that includes the current query, has_documents flag, current_answer, and iteration count. The LLM must return a JSON object with "action" and "reasoning" keys.
ROUTING_TEMPLATE = """You are a query routing agent. Given a query and optional current answer, decide what action to take next.

Current State:
- Query: {query}
- Has Retrieved Documents: {has_documents}
- Current Answer: {current_answer}
- Iteration: {iteration}/{max_iterations}

Your task is to decide ONE of the following actions:
1. 'search': Retrieve documents from vector database (choose this if you need more information)
2. 'reflect': Verify and improve the current answer (choose this to validate answer quality)
3. 'generate': Create final answer (choose this when you have enough information)

Return a JSON object with this exact format:
{{"action": "search|reflect|generate", "reasoning": "brief explanation"}}

Do NOT include any other text. Return ONLY the JSON object."""
Invalid JSON or unrecognized action values raise ValueError for fast debugging.

Example execution trace

from vectordb.langchain.agentic_rag import AgenticRAGPipeline

pipeline = AgenticRAGPipeline(config)
result = pipeline.run("What are the three laws of thermodynamics?")

print(result["history"])
# [
#   {"iteration": 1, "action": "search", "reasoning": "No documents retrieved yet"},
#   {"iteration": 2, "action": "reflect", "reasoning": "Need to verify completeness"},
#   {"iteration": 3, "action": "generate", "reasoning": "Sufficient information gathered"}
# ]

print(result["answer"])
# "The three laws of thermodynamics are: 1) Energy cannot be created or destroyed..."

Common pitfalls

No hard loop cap: Without max_iterations, the loop can cycle indefinitely. Always set a finite limit.
Ambiguous routing prompts: If the prompt does not clearly define when to choose "search" vs "reflect" vs "generate", the router oscillates without making progress.
Missing observability: Log every routing decision, action, and reasoning string. The AgenticRouter logs at INFO level for decisions and DEBUG level for full prompts.

Backends supported

Chroma, Milvus, Pinecone, Qdrant, Weaviate.

Next steps

Query enhancement

Use query enhancement for lighter-weight recall improvements without the full agentic loop

Reranking

Use reranking for faster single-pass pipelines sufficient for simpler queries

Components

Explore the AgenticRouter and QueryEnhancer components in detail

Semantic search

Start with baseline semantic search before adding agentic patterns

Build docs developers (and LLMs) love