RAG Pipeline Optimization

Retrieval-Augmented Generation (RAG) systems combine search with LLM generation. GEPA optimizes every component of RAG pipelines: query reformulation, retrieval strategies, reranking, context synthesis, and answer generation.

Key Results

HotpotQA Multi-Hop

Optimized query generation for second-hop retrieval with detailed strategies

Healthcare RAG

Multi-agent system for diabetes and COPD with specialized retrievers

Vector Store Agnostic

Works with ChromaDB, Weaviate, Qdrant, Pinecone, and more

Weaviate Integration

Official tutorial for reranker optimization in RAG pipelines

What Can Be Optimized?

In a RAG pipeline, GEPA can optimize:

Query Reformulation: Transform user queries for better retrieval
Retrieval Prompts: Instructions for what to retrieve and why
Reranking Strategies: How to prioritize retrieved documents
Context Synthesis: How to combine multiple documents
Answer Generation: Prompts for generating final answers
Multi-Hop Logic: Strategies for iterative retrieval

RAG Optimization with DSPy

The most powerful way to optimize RAG pipelines is with DSPy:

import dspy
from dspy.retrieve import ColBERTv2

# Define RAG program
class RAGProgram(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=5)
        self.generate = dspy.ChainOfThought("context, question -> answer")
    
    def forward(self, question):
        context = self.retrieve(question).passages
        return self.generate(context=context, question=question)

# Optimize with GEPA
optimizer = dspy.GEPA(
    metric=answer_correctness,
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)

optimized_rag = optimizer.compile(
    student=RAGProgram(),
    trainset=train_questions,
    valset=val_questions,
)

What Gets Optimized?

DSPy+GEPA automatically optimizes:

Retrieval query formulation
Number of passages to retrieve (adaptive k)
Instructions for context usage in generation
Chain-of-thought reasoning strategies

Generic RAG Adapter

For non-DSPy RAG systems, use the Generic RAG Adapter:

from gepa.adapters.generic_rag_adapter import GenericRAGAdapter
import chromadb

# Initialize vector store
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("my_documents")

# Create adapter
adapter = GenericRAGAdapter(
    collection=collection,
    task_lm="openai/gpt-4.5",
    embedding_model="openai/text-embedding-3-large",
)

# Optimize
result = gepa.optimize(
    adapter=adapter,
    trainset=train_questions,
    valset=val_questions,
    max_metric_calls=100,
)

Supported Vector Stores

The Generic RAG Adapter works with:

ChromaDB: Local and client-server
Weaviate: Open-source vector database
Qdrant: High-performance vector search
Pinecone: Managed vector database
Custom: Implement the vector store interface

Use Case 1: Multi-Hop Question Answering

Problem: HotpotQA requires retrieving information across multiple documents to answer complex questions.

Example: Second-Hop Query Generation

Given:

Original question: “What is the population of the archipelago containing Arco da Calheta?”
First-hop summary: “Arco da Calheta is a civil parish in Madeira with population 3,226 in 2011”

Naive approach:

Just use the original question → retrieves more documents about Arco da Calheta

GEPA-optimized approach:

Generate second-hop query: “Madeira archipelago population in 2011”
This retrieves documents about the wider region, not just the parish

Evolved Strategy (Excerpt)

Your task is to generate a new search query optimized for the **second hop** 
of a multi-hop retrieval system.

Key Observations from Examples and Feedback:

- First-hop documents often cover one entity or aspect in the question
- Remaining relevant documents often involve connected or higher-level concepts 
  mentioned in summary_1 but not explicitly asked in the original question
- The query should be formulated to explicitly target these *missing*, but 
  logically linked, documents

How to Build the Query:

1. Identify the entities or topics mentioned in summary_1 that appear related 
   but different from first-hop documents
2. Reframe the query to explicitly mention these broader or related entities 
   connected to the original question
3. Include relevant key context from the question to maintain specificity, but 
   shift focus to the missing piece
4. The goal is to retrieve documents that link or complement what was retrieved initially

[...]

See full prompt in README.md:202-247.

Implementation

import dspy

class MultiHopRAG(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=5)
        self.generate_query = dspy.ChainOfThought(
            "question, summary_1 -> query"  # Second-hop query
        )
        self.answer = dspy.ChainOfThought(
            "question, context -> answer"
        )
    
    def forward(self, question):
        # First hop: use original question
        passages_1 = self.retrieve(question).passages
        summary_1 = summarize(passages_1)
        
        # Second hop: optimized query generation
        second_hop_query = self.generate_query(
            question=question,
            summary_1=summary_1
        ).query
        
        passages_2 = self.retrieve(second_hop_query).passages
        
        # Combine and answer
        all_context = passages_1 + passages_2
        return self.answer(question=question, context=all_context)

# Optimize with GEPA
optimizer = dspy.GEPA(metric=f1_score, max_metric_calls=150)
optimized_rag = optimizer.compile(
    student=MultiHopRAG(),
    trainset=hotpotqa_train,
    valset=hotpotqa_val,
)

Use Case 2: Healthcare Multi-Agent RAG

Problem: General medical RAG systems struggle with specialized disease knowledge. Solution: GEPA discovers a multi-agent architecture with disease-specific experts.

Architecture

class HealthcareRAG(dspy.Module):
    def __init__(self):
        # Specialized sub-agents
        self.diabetes_expert = DiabetesExpert()
        self.copd_expert = COPDExpert()
        self.lead_agent = LeadAgent()
    
    def forward(self, question):
        # Classify query
        disease = self.lead_agent.classify(question)
        
        # Route to expert
        if disease == "diabetes":
            expert = self.diabetes_expert
        elif disease == "copd":
            expert = self.copd_expert
        else:
            expert = self.lead_agent  # General fallback
        
        # Expert does specialized retrieval
        context = expert.retrieve(question)
        answer = expert.reason(question, context)
        
        # Lead agent validates and synthesizes
        return self.lead_agent.synthesize(question, answer, disease)


class DiabetesExpert(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=5)
        self.reason = dspy.ChainOfThought(
            "question, context -> answer"
        )
    
    def retrieve(self, question):
        # Diabetes-specific retrieval strategy (optimized by GEPA)
        enhanced_query = self.enhance_query_for_diabetes(question)
        return self.retrieve(enhanced_query).passages

Results

Improved retrieval precision for disease-specific queries
Better answer quality through specialized reasoning
Graceful fallback for general medical questions

Read the full case study →

Use Case 3: Reranking Optimization

Reranking retrieved documents is crucial for RAG quality. GEPA optimizes reranking prompts.

Weaviate Tutorial

Official tutorial from Weaviate on optimizing listwise rerankers:

import dspy
from dspy.retrieve import WeaviateRM

class RerankedRAG(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=20)  # Over-retrieve
        self.rerank = dspy.ChainOfThought(
            "question, passages -> ranked_passages"
        )
        self.answer = dspy.ChainOfThought(
            "question, context -> answer"
        )
    
    def forward(self, question):
        # Retrieve many candidates
        passages = self.retrieve(question).passages
        
        # Rerank to top-5
        ranked = self.rerank(
            question=question,
            passages=passages
        ).ranked_passages[:5]
        
        return self.answer(question=question, context=ranked)

# Optimize reranking strategy
optimizer = dspy.GEPA(metric=answer_f1, max_metric_calls=100)
optimized_rag = optimizer.compile(
    student=RerankedRAG(),
    trainset=train_data,
    valset=val_data,
)

Watch the video tutorial → | View the notebook →

optimize_anything for RAG

For more control, use optimize_anything directly:

import gepa.optimize_anything as oa
from chromadb import Client

def evaluate_rag(candidate: dict, example: dict) -> tuple[float, dict]:
    """
    Evaluate RAG system with custom retrieval and generation prompts.
    
    candidate: {
        "query_prompt": "...",
        "answer_prompt": "..."
    }
    """
    # Reformulate query using optimized prompt
    query = llm.generate(
        candidate["query_prompt"].format(question=example["question"])
    )
    
    # Retrieve
    results = vector_store.query(query, n_results=5)
    context = "\n\n".join(results["documents"][0])
    
    # Generate answer using optimized prompt
    answer = llm.generate(
        candidate["answer_prompt"].format(
            context=context,
            question=example["question"]
        )
    )
    
    # Score
    f1 = compute_f1(answer, example["answer"])
    
    return f1, {
        "ReformulatedQuery": query,
        "RetrievedDocs": len(results["documents"][0]),
        "GeneratedAnswer": answer,
        "GoldAnswer": example["answer"],
    }

# Initial prompts
seed_prompts = {
    "query_prompt": "Reformulate this question for search: {question}",
    "answer_prompt": "Context: {context}\n\nQuestion: {question}\nAnswer:",
}

result = oa.optimize_anything(
    seed_candidate=seed_prompts,
    evaluator=evaluate_rag,
    dataset=train_questions,
    valset=val_questions,
    objective="Optimize RAG prompts for accuracy and relevance.",
    config=oa.GEPAConfig(
        engine=oa.EngineConfig(max_metric_calls=150),
        reflection=oa.ReflectionConfig(reflection_lm="openai/gpt-5"),
    ),
)

Common RAG Failure Modes and GEPA Solutions

Poor Query Formulation

Problem: User queries don’t match document phrasing.GEPA Solution: Learns query reformulation strategies that bridge the vocabulary gap.Example: “How do I fix X?” → “X troubleshooting guide error solutions”

Irrelevant Retrieved Documents

Problem: Top-k retrieval returns off-topic documents.GEPA Solution: Optimizes both the retrieval prompt and reranking logic to surface relevant content.

Context Overload

Problem: Too many documents overwhelm the LLM’s context window.GEPA Solution: Discovers strategies for context synthesis and compression.

Multi-Hop Failures

Problem: Complex questions require multiple retrieval steps.GEPA Solution: Evolves iterative retrieval strategies that know when to do second or third hops.

Domain Mismatch

Problem: General retrieval strategies fail on specialized domains.GEPA Solution: Learns domain-specific retrieval and reasoning patterns (see Healthcare RAG example).

Best Practices

Start with a working baseline

Even a simple retrieve-then-generate pipeline is enough. GEPA will evolve sophistication.

Use informative metrics

F1 score, exact match, or ROUGE for QA tasks. For open-ended generation, use LLM-as-judge.

Optimize end-to-end, not components in isolation

Retrieve + rerank + generate should be optimized together so they co-adapt.

Provide retrieval diagnostics in ASI

Return the actual retrieved documents in side info so GEPA can see what was retrieved.

Use validation sets for generalization

RAG systems must generalize to unseen questions. Always use a valset.

Integration Examples

LangChain RAG

from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
import gepa.optimize_anything as oa

def evaluate_langchain_rag(candidate: dict, example: dict) -> tuple[float, dict]:
    # Create chain with optimized prompt
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        chain_type_kwargs={"prompt": candidate["qa_prompt"]}
    )
    
    answer = qa_chain.run(example["question"])
    score = compute_score(answer, example["answer"])
    
    return score, {"Answer": answer}

result = oa.optimize_anything(
    seed_candidate={"qa_prompt": default_prompt},
    evaluator=evaluate_langchain_rag,
    dataset=train_data,
    valset=val_data,
)

LlamaIndex RAG

from llama_index import VectorStoreIndex, ServiceContext
import gepa.optimize_anything as oa

def evaluate_llamaindex_rag(candidate: dict, example: dict) -> tuple[float, dict]:
    # Build index with optimized query prompt
    service_context = ServiceContext.from_defaults(
        llm=llm,
        system_prompt=candidate["system_prompt"]
    )
    
    index = VectorStoreIndex.from_documents(
        documents,
        service_context=service_context
    )
    
    query_engine = index.as_query_engine()
    response = query_engine.query(example["question"])
    
    score = compute_score(response.response, example["answer"])
    
    return score, {"Response": response.response}

result = oa.optimize_anything(
    seed_candidate={"system_prompt": "You are a helpful assistant."},
    evaluator=evaluate_llamaindex_rag,
    dataset=train_data,
    valset=val_data,
)

Production Deployments

OCR Document Understanding

Intrinsic Labs achieved up to 38% OCR error reduction using GEPA-optimized prompts for document extraction:

Gemini 2.5 Pro
Gemini 2.5 Flash
Gemini 2.0 Flash

Read the research paper →

Enterprise Document Search

FireBird Technologies optimized their Auto-Analyst platform with GEPA:

4 specialized agents: Pre-processing, Statistical Analytics, ML, Visualization
Optimized 4 primary signatures covering 90% of code runs
Tested across multiple model providers to avoid overfitting

Read the case study →

Metrics for RAG Evaluation

Answer Correctness

F1, Exact Match, ROUGE-L for QA tasks

Retrieval Quality

Precision@k, Recall@k, MRR for retrieved documents

Relevance

LLM-as-judge scoring answer relevance and faithfulness

Latency

End-to-end response time including retrieval and generation

Example Metric Implementation

def rag_metric(example, prediction, trace=None) -> float:
    """
    Composite RAG metric combining correctness and relevance.
    """
    # Answer correctness (F1)
    f1 = compute_f1(prediction.answer, example.answer)
    
    # Retrieval precision (are retrieved docs relevant?)
    precision = evaluate_retrieval_precision(
        retrieved=prediction.retrieved_docs,
        gold_docs=example.relevant_docs
    )
    
    # Answer faithfulness (is answer grounded in context?)
    faithfulness = llm_judge(
        context=prediction.context,
        answer=prediction.answer
    )
    
    # Weighted combination
    return 0.5 * f1 + 0.25 * precision + 0.25 * faithfulness

Next Steps

DSPy + GEPA Tutorial

Complete RAG optimization walkthrough

Generic RAG Adapter

Use GEPA with any vector store

Prompt Optimization

Optimize individual prompts

Agent Architecture

Discover multi-agent architectures

Get Started

Core Concepts

Guides

Use Cases

​Key Results

HotpotQA Multi-Hop

Healthcare RAG

Vector Store Agnostic

Weaviate Integration

​What Can Be Optimized?

​RAG Optimization with DSPy

​What Gets Optimized?

​Generic RAG Adapter

​Supported Vector Stores

​Use Case 1: Multi-Hop Question Answering

​Example: Second-Hop Query Generation

​Evolved Strategy (Excerpt)

​Implementation

​Use Case 2: Healthcare Multi-Agent RAG

​Architecture

​Results

​Use Case 3: Reranking Optimization

​Weaviate Tutorial

​optimize_anything for RAG

​Common RAG Failure Modes and GEPA Solutions

​Best Practices

​Integration Examples

​LangChain RAG

​LlamaIndex RAG

​Production Deployments

​OCR Document Understanding

​Enterprise Document Search

​Metrics for RAG Evaluation

Answer Correctness

Retrieval Quality

Relevance

Latency

​Example Metric Implementation

​Next Steps

DSPy + GEPA Tutorial

Generic RAG Adapter

Prompt Optimization

Agent Architecture

Build docs developers (and LLMs) love

Key Results

What Can Be Optimized?

RAG Optimization with DSPy

What Gets Optimized?

Generic RAG Adapter

Supported Vector Stores

Use Case 1: Multi-Hop Question Answering

Example: Second-Hop Query Generation

Evolved Strategy (Excerpt)

Implementation

Use Case 2: Healthcare Multi-Agent RAG

Architecture

Results

Use Case 3: Reranking Optimization

Weaviate Tutorial

optimize_anything for RAG

Common RAG Failure Modes and GEPA Solutions

Best Practices

Integration Examples

LangChain RAG

LlamaIndex RAG

Production Deployments

OCR Document Understanding

Enterprise Document Search

Metrics for RAG Evaluation

Example Metric Implementation

Next Steps