Chain architecture and patterns

Agentic RAG introduces iterative decision-making into the retrieval pipeline. Instead of a single retrieve-then-generate pass, the system evaluates its candidate answer and decides whether to retrieve more evidence, reflect on the answer, or finalize — repeating until quality is acceptable or the iteration limit is reached.

How it works

The pipeline uses AgenticRouter (from components/agentic_router.py) as the central decision-maker. The router is a ChatGroq LLM that returns structured JSON decisions at each step.

Decision loop

Router decisions

The AgenticRouter.route() method receives:

query: the original user question
has_documents: whether documents have already been retrieved
current_answer: the current draft answer (if any)
iteration: current iteration number (1-indexed)
max_iterations: the hard loop limit

It returns {"action": "search|reflect|generate", "reasoning": "..."}. When iteration >= max_iterations, the router automatically returns "generate" regardless of quality — this is the safety mechanism that prevents infinite loops.

State machine

search
reflect
generate

Called when no documents have been retrieved yet, or when reflection identified information gaps requiring additional retrieval.

# Router decides: search
documents = vector_store.retrieve(query, top_k=10)
draft_answer = llm.generate(query, documents)

Called when documents exist but answer confidence is uncertain. The router sends the query, current answer, and context back to the LLM for gap identification and correction.

# Router decides: reflect
reflection = llm.reflect(query, current_answer, documents)
improved_answer = llm.improve(current_answer, reflection)

Called when sufficient information has been gathered and the answer is ready for delivery.

# Router decides: generate
final_answer = format_answer(current_answer, documents)
return final_answer

Pipeline implementation

The agentic RAG pipeline orchestrates the router, retriever, and generator:

src/vectordb/langchain/agentic_rag/base.py

from langchain_groq import ChatGroq
from vectordb.langchain.components import AgenticRouter
from vectordb.langchain.utils import EmbedderHelper, RAGHelper

class AgenticRAGPipeline:
    """Agentic RAG pipeline with iterative retrieval and reflection."""

    def __init__(self, config: dict) -> None:
        self.config = config
        
        # Initialize components
        self.embedder = EmbedderHelper.create_embedder(config)
        self.llm = RAGHelper.create_llm(config)
        
        # Create router with deterministic LLM
        router_llm = ChatGroq(
            model=config.get("agentic_rag", {}).get("model", "llama-3.3-70b-versatile"),
            temperature=0.0,  # Deterministic routing
        )
        self.router = AgenticRouter(router_llm)
        
        self.max_iterations = config.get("agentic_rag", {}).get("max_iterations", 3)

    def run(self, query: str) -> dict:
        """Execute agentic RAG pipeline.
        
        Args:
            query: User's question.
        
        Returns:
            Dictionary with final answer and iteration history.
        """
        documents = []
        current_answer = None
        iteration = 1
        history = []

        while iteration <= self.max_iterations:
            # Route to next action
            decision = self.router.route(
                query=query,
                has_documents=len(documents) > 0,
                current_answer=current_answer,
                iteration=iteration,
                max_iterations=self.max_iterations,
            )

            action = decision["action"]
            reasoning = decision["reasoning"]
            history.append({"iteration": iteration, "action": action, "reasoning": reasoning})

            if action == "search":
                # Retrieve documents from vector store
                new_docs = self.retrieve(query)
                documents.extend(new_docs)
                
                # Generate draft answer
                if self.llm:
                    current_answer = RAGHelper.generate(self.llm, query, documents)

            elif action == "reflect":
                # Evaluate and improve current answer
                if self.llm and current_answer:
                    current_answer = self.reflect_and_improve(
                        query, current_answer, documents
                    )

            elif action == "generate":
                # Finalize and return answer
                break

            iteration += 1

        # Generate final answer if not already done
        if not current_answer and self.llm and documents:
            current_answer = RAGHelper.generate(self.llm, query, documents)

        return {
            "query": query,
            "answer": current_answer or "No answer generated",
            "documents": documents,
            "iterations": iteration,
            "history": history,
        }

    def retrieve(self, query: str) -> list:
        """Retrieve documents from vector store."""
        query_embedding = EmbedderHelper.embed_query(self.embedder, query)
        # Vector store retrieval logic
        return []

    def reflect_and_improve(self, query: str, answer: str, documents: list) -> str:
        """Reflect on answer quality and improve."""
        reflection_prompt = f"""Given the question and current answer, identify gaps or inaccuracies.

Question: {query}
Current Answer: {answer}

What is missing or incorrect?"""
        
        reflection = self.llm.invoke(reflection_prompt).content
        
        improvement_prompt = f"""Improve the answer based on the reflection.

Question: {query}
Current Answer: {answer}
Reflection: {reflection}

Improved Answer:"""
        
        improved = self.llm.invoke(improvement_prompt).content
        return improved

Configuration

agentic_rag:
  max_iterations: 3               # Hard iteration cap
  model: "llama-3.3-70b-versatile"

search:
  top_k: 10

llm:
  api_key: "${GROQ_API_KEY}"
  temperature: 0.0                # Deterministic routing

rag:
  enabled: true
  temperature: 0.7                # Creative generation

When to use it

Complex multi-hop questions where the answer requires combining information from multiple retrieval passes
High-stakes applications where answer completeness and grounding must be verified before delivery
Workflows where single-pass retrieval and generation consistently underperforms on hard questions

When not to use it

Strict low-latency endpoints where each LLM call (routing + reflection) adds hundreds of milliseconds
Simple factual lookups where one-shot retrieval consistently finds the answer

Tradeoffs

Dimension	What to expect
Quality	Potentially the highest on complex tasks requiring multiple evidence sources
Latency	Highest of all features — each iteration adds two or more LLM calls
Cost	Highest — routing, retrieval, and reflection each consume LLM tokens

Settings to tune first

agentic_rag.max_iterations

integer

default:"3"

Start with 2; increase only if quality does not converge at lower values.Without a hard loop cap, the router can cycle indefinitely. Always set a finite limit.

llm.temperature

float

default:"0.0"

Use 0.0 for routing (deterministic decisions); 0.7 for answer generation (creative).

agentic_rag.model

string

default:"llama-3.3-70b-versatile"

The routing template directly determines how well the LLM navigates the state machine.Recommended: llama-3.3-70b-versatile or gpt-4

Router prompt template

The AgenticRouter.ROUTING_TEMPLATE is a structured prompt that includes the current query, has_documents flag, current_answer, and iteration count. The LLM must return a JSON object with "action" and "reasoning" keys.

ROUTING_TEMPLATE = """You are a query routing agent. Given a query and optional current answer, decide what action to take next.

Current State:
- Query: {query}
- Has Retrieved Documents: {has_documents}
- Current Answer: {current_answer}
- Iteration: {iteration}/{max_iterations}

Your task is to decide ONE of the following actions:
1. 'search': Retrieve documents from vector database (choose this if you need more information)
2. 'reflect': Verify and improve the current answer (choose this to validate answer quality)
3. 'generate': Create final answer (choose this when you have enough information)

Return a JSON object with this exact format:
{{"action": "search|reflect|generate", "reasoning": "brief explanation"}}

Do NOT include any other text. Return ONLY the JSON object."""

Invalid JSON or unrecognized action values raise ValueError for fast debugging.

Example execution trace

from vectordb.langchain.agentic_rag import AgenticRAGPipeline

pipeline = AgenticRAGPipeline(config)
result = pipeline.run("What are the three laws of thermodynamics?")

print(result["history"])
# [
#   {"iteration": 1, "action": "search", "reasoning": "No documents retrieved yet"},
#   {"iteration": 2, "action": "reflect", "reasoning": "Need to verify completeness"},
#   {"iteration": 3, "action": "generate", "reasoning": "Sufficient information gathered"}
# ]

print(result["answer"])
# "The three laws of thermodynamics are: 1) Energy cannot be created or destroyed..."

Common pitfalls

No hard loop cap: Without max_iterations, the loop can cycle indefinitely. Always set a finite limit.

Ambiguous routing prompts: If the prompt does not clearly define when to choose "search" vs "reflect" vs "generate", the router oscillates without making progress.

Missing observability: Log every routing decision, action, and reasoning string. The AgenticRouter logs at INFO level for decisions and DEBUG level for full prompts.

Backends supported

Chroma, Milvus, Pinecone, Qdrant, Weaviate.

Next steps

Query enhancement

Use query enhancement for lighter-weight recall improvements without the full agentic loop

Reranking

Use reranking for faster single-pass pipelines sufficient for simpler queries

Components

Explore the AgenticRouter and QueryEnhancer components in detail

Semantic search

Start with baseline semantic search before adding agentic patterns

Haystack

LangChain

Chain architecture and patterns

How it works

Decision loop

Router decisions

State machine

Pipeline implementation

Configuration

When to use it

When not to use it

Tradeoffs

Settings to tune first

Router prompt template

Example execution trace

Common pitfalls

Backends supported

Next steps

Query enhancement

Reranking

Components

Semantic search

Build docs developers (and LLMs) love

Haystack

LangChain

​How it works

​Decision loop

​Router decisions

​State machine

​Pipeline implementation

​Configuration

​When to use it

​When not to use it

​Tradeoffs

​Settings to tune first

​Router prompt template

​Example execution trace

​Common pitfalls

​Backends supported

​Next steps

Query enhancement

Reranking

Components

Semantic search

Build docs developers (and LLMs) love

How it works

Decision loop

Router decisions

State machine

Pipeline implementation

Configuration

When to use it

When not to use it

Tradeoffs

Settings to tune first

Router prompt template

Example execution trace

Common pitfalls

Backends supported

Next steps