Skip to main content
Retrieval-Augmented Generation (RAG) systems combine search with LLM generation. GEPA optimizes every component of RAG pipelines: query reformulation, retrieval strategies, reranking, context synthesis, and answer generation.

Key Results

HotpotQA Multi-Hop

Optimized query generation for second-hop retrieval with detailed strategies

Healthcare RAG

Multi-agent system for diabetes and COPD with specialized retrievers

Vector Store Agnostic

Works with ChromaDB, Weaviate, Qdrant, Pinecone, and more

Weaviate Integration

Official tutorial for reranker optimization in RAG pipelines

What Can Be Optimized?

In a RAG pipeline, GEPA can optimize:
  1. Query Reformulation: Transform user queries for better retrieval
  2. Retrieval Prompts: Instructions for what to retrieve and why
  3. Reranking Strategies: How to prioritize retrieved documents
  4. Context Synthesis: How to combine multiple documents
  5. Answer Generation: Prompts for generating final answers
  6. Multi-Hop Logic: Strategies for iterative retrieval

RAG Optimization with DSPy

The most powerful way to optimize RAG pipelines is with DSPy:
import dspy
from dspy.retrieve import ColBERTv2

# Define RAG program
class RAGProgram(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=5)
        self.generate = dspy.ChainOfThought("context, question -> answer")
    
    def forward(self, question):
        context = self.retrieve(question).passages
        return self.generate(context=context, question=question)

# Optimize with GEPA
optimizer = dspy.GEPA(
    metric=answer_correctness,
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)

optimized_rag = optimizer.compile(
    student=RAGProgram(),
    trainset=train_questions,
    valset=val_questions,
)

What Gets Optimized?

DSPy+GEPA automatically optimizes:
  • Retrieval query formulation
  • Number of passages to retrieve (adaptive k)
  • Instructions for context usage in generation
  • Chain-of-thought reasoning strategies

Generic RAG Adapter

For non-DSPy RAG systems, use the Generic RAG Adapter:
from gepa.adapters.generic_rag_adapter import GenericRAGAdapter
import chromadb

# Initialize vector store
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("my_documents")

# Create adapter
adapter = GenericRAGAdapter(
    collection=collection,
    task_lm="openai/gpt-4.5",
    embedding_model="openai/text-embedding-3-large",
)

# Optimize
result = gepa.optimize(
    adapter=adapter,
    trainset=train_questions,
    valset=val_questions,
    max_metric_calls=100,
)

Supported Vector Stores

The Generic RAG Adapter works with:
  • ChromaDB: Local and client-server
  • Weaviate: Open-source vector database
  • Qdrant: High-performance vector search
  • Pinecone: Managed vector database
  • Custom: Implement the vector store interface

Use Case 1: Multi-Hop Question Answering

Problem: HotpotQA requires retrieving information across multiple documents to answer complex questions.

Example: Second-Hop Query Generation

Given:
  • Original question: “What is the population of the archipelago containing Arco da Calheta?”
  • First-hop summary: “Arco da Calheta is a civil parish in Madeira with population 3,226 in 2011”
Naive approach:
  • Just use the original question → retrieves more documents about Arco da Calheta
GEPA-optimized approach:
  • Generate second-hop query: “Madeira archipelago population in 2011”
  • This retrieves documents about the wider region, not just the parish

Evolved Strategy (Excerpt)

Your task is to generate a new search query optimized for the **second hop** 
of a multi-hop retrieval system.

Key Observations from Examples and Feedback:

- First-hop documents often cover one entity or aspect in the question
- Remaining relevant documents often involve connected or higher-level concepts 
  mentioned in summary_1 but not explicitly asked in the original question
- The query should be formulated to explicitly target these *missing*, but 
  logically linked, documents

How to Build the Query:

1. Identify the entities or topics mentioned in summary_1 that appear related 
   but different from first-hop documents
2. Reframe the query to explicitly mention these broader or related entities 
   connected to the original question
3. Include relevant key context from the question to maintain specificity, but 
   shift focus to the missing piece
4. The goal is to retrieve documents that link or complement what was retrieved initially

[...]
See full prompt in README.md:202-247.

Implementation

import dspy

class MultiHopRAG(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=5)
        self.generate_query = dspy.ChainOfThought(
            "question, summary_1 -> query"  # Second-hop query
        )
        self.answer = dspy.ChainOfThought(
            "question, context -> answer"
        )
    
    def forward(self, question):
        # First hop: use original question
        passages_1 = self.retrieve(question).passages
        summary_1 = summarize(passages_1)
        
        # Second hop: optimized query generation
        second_hop_query = self.generate_query(
            question=question,
            summary_1=summary_1
        ).query
        
        passages_2 = self.retrieve(second_hop_query).passages
        
        # Combine and answer
        all_context = passages_1 + passages_2
        return self.answer(question=question, context=all_context)

# Optimize with GEPA
optimizer = dspy.GEPA(metric=f1_score, max_metric_calls=150)
optimized_rag = optimizer.compile(
    student=MultiHopRAG(),
    trainset=hotpotqa_train,
    valset=hotpotqa_val,
)

Use Case 2: Healthcare Multi-Agent RAG

Problem: General medical RAG systems struggle with specialized disease knowledge. Solution: GEPA discovers a multi-agent architecture with disease-specific experts.

Architecture

class HealthcareRAG(dspy.Module):
    def __init__(self):
        # Specialized sub-agents
        self.diabetes_expert = DiabetesExpert()
        self.copd_expert = COPDExpert()
        self.lead_agent = LeadAgent()
    
    def forward(self, question):
        # Classify query
        disease = self.lead_agent.classify(question)
        
        # Route to expert
        if disease == "diabetes":
            expert = self.diabetes_expert
        elif disease == "copd":
            expert = self.copd_expert
        else:
            expert = self.lead_agent  # General fallback
        
        # Expert does specialized retrieval
        context = expert.retrieve(question)
        answer = expert.reason(question, context)
        
        # Lead agent validates and synthesizes
        return self.lead_agent.synthesize(question, answer, disease)


class DiabetesExpert(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=5)
        self.reason = dspy.ChainOfThought(
            "question, context -> answer"
        )
    
    def retrieve(self, question):
        # Diabetes-specific retrieval strategy (optimized by GEPA)
        enhanced_query = self.enhance_query_for_diabetes(question)
        return self.retrieve(enhanced_query).passages

Results

  • Improved retrieval precision for disease-specific queries
  • Better answer quality through specialized reasoning
  • Graceful fallback for general medical questions
Read the full case study →

Use Case 3: Reranking Optimization

Reranking retrieved documents is crucial for RAG quality. GEPA optimizes reranking prompts.

Weaviate Tutorial

Official tutorial from Weaviate on optimizing listwise rerankers:
import dspy
from dspy.retrieve import WeaviateRM

class RerankedRAG(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=20)  # Over-retrieve
        self.rerank = dspy.ChainOfThought(
            "question, passages -> ranked_passages"
        )
        self.answer = dspy.ChainOfThought(
            "question, context -> answer"
        )
    
    def forward(self, question):
        # Retrieve many candidates
        passages = self.retrieve(question).passages
        
        # Rerank to top-5
        ranked = self.rerank(
            question=question,
            passages=passages
        ).ranked_passages[:5]
        
        return self.answer(question=question, context=ranked)

# Optimize reranking strategy
optimizer = dspy.GEPA(metric=answer_f1, max_metric_calls=100)
optimized_rag = optimizer.compile(
    student=RerankedRAG(),
    trainset=train_data,
    valset=val_data,
)
Watch the video tutorial → | View the notebook →

optimize_anything for RAG

For more control, use optimize_anything directly:
import gepa.optimize_anything as oa
from chromadb import Client

def evaluate_rag(candidate: dict, example: dict) -> tuple[float, dict]:
    """
    Evaluate RAG system with custom retrieval and generation prompts.
    
    candidate: {
        "query_prompt": "...",
        "answer_prompt": "..."
    }
    """
    # Reformulate query using optimized prompt
    query = llm.generate(
        candidate["query_prompt"].format(question=example["question"])
    )
    
    # Retrieve
    results = vector_store.query(query, n_results=5)
    context = "\n\n".join(results["documents"][0])
    
    # Generate answer using optimized prompt
    answer = llm.generate(
        candidate["answer_prompt"].format(
            context=context,
            question=example["question"]
        )
    )
    
    # Score
    f1 = compute_f1(answer, example["answer"])
    
    return f1, {
        "ReformulatedQuery": query,
        "RetrievedDocs": len(results["documents"][0]),
        "GeneratedAnswer": answer,
        "GoldAnswer": example["answer"],
    }

# Initial prompts
seed_prompts = {
    "query_prompt": "Reformulate this question for search: {question}",
    "answer_prompt": "Context: {context}\n\nQuestion: {question}\nAnswer:",
}

result = oa.optimize_anything(
    seed_candidate=seed_prompts,
    evaluator=evaluate_rag,
    dataset=train_questions,
    valset=val_questions,
    objective="Optimize RAG prompts for accuracy and relevance.",
    config=oa.GEPAConfig(
        engine=oa.EngineConfig(max_metric_calls=150),
        reflection=oa.ReflectionConfig(reflection_lm="openai/gpt-5"),
    ),
)

Common RAG Failure Modes and GEPA Solutions

Problem: User queries don’t match document phrasing.GEPA Solution: Learns query reformulation strategies that bridge the vocabulary gap.Example: “How do I fix X?” → “X troubleshooting guide error solutions”
Problem: Top-k retrieval returns off-topic documents.GEPA Solution: Optimizes both the retrieval prompt and reranking logic to surface relevant content.
Problem: Too many documents overwhelm the LLM’s context window.GEPA Solution: Discovers strategies for context synthesis and compression.
Problem: Complex questions require multiple retrieval steps.GEPA Solution: Evolves iterative retrieval strategies that know when to do second or third hops.
Problem: General retrieval strategies fail on specialized domains.GEPA Solution: Learns domain-specific retrieval and reasoning patterns (see Healthcare RAG example).

Best Practices

Even a simple retrieve-then-generate pipeline is enough. GEPA will evolve sophistication.
F1 score, exact match, or ROUGE for QA tasks. For open-ended generation, use LLM-as-judge.
Retrieve + rerank + generate should be optimized together so they co-adapt.
Return the actual retrieved documents in side info so GEPA can see what was retrieved.
RAG systems must generalize to unseen questions. Always use a valset.

Integration Examples

LangChain RAG

from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
import gepa.optimize_anything as oa

def evaluate_langchain_rag(candidate: dict, example: dict) -> tuple[float, dict]:
    # Create chain with optimized prompt
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        chain_type_kwargs={"prompt": candidate["qa_prompt"]}
    )
    
    answer = qa_chain.run(example["question"])
    score = compute_score(answer, example["answer"])
    
    return score, {"Answer": answer}

result = oa.optimize_anything(
    seed_candidate={"qa_prompt": default_prompt},
    evaluator=evaluate_langchain_rag,
    dataset=train_data,
    valset=val_data,
)

LlamaIndex RAG

from llama_index import VectorStoreIndex, ServiceContext
import gepa.optimize_anything as oa

def evaluate_llamaindex_rag(candidate: dict, example: dict) -> tuple[float, dict]:
    # Build index with optimized query prompt
    service_context = ServiceContext.from_defaults(
        llm=llm,
        system_prompt=candidate["system_prompt"]
    )
    
    index = VectorStoreIndex.from_documents(
        documents,
        service_context=service_context
    )
    
    query_engine = index.as_query_engine()
    response = query_engine.query(example["question"])
    
    score = compute_score(response.response, example["answer"])
    
    return score, {"Response": response.response}

result = oa.optimize_anything(
    seed_candidate={"system_prompt": "You are a helpful assistant."},
    evaluator=evaluate_llamaindex_rag,
    dataset=train_data,
    valset=val_data,
)

Production Deployments

OCR Document Understanding

Intrinsic Labs achieved up to 38% OCR error reduction using GEPA-optimized prompts for document extraction:
  • Gemini 2.5 Pro
  • Gemini 2.5 Flash
  • Gemini 2.0 Flash
Read the research paper → FireBird Technologies optimized their Auto-Analyst platform with GEPA:
  • 4 specialized agents: Pre-processing, Statistical Analytics, ML, Visualization
  • Optimized 4 primary signatures covering 90% of code runs
  • Tested across multiple model providers to avoid overfitting
Read the case study →

Metrics for RAG Evaluation

Answer Correctness

F1, Exact Match, ROUGE-L for QA tasks

Retrieval Quality

Precision@k, Recall@k, MRR for retrieved documents

Relevance

LLM-as-judge scoring answer relevance and faithfulness

Latency

End-to-end response time including retrieval and generation

Example Metric Implementation

def rag_metric(example, prediction, trace=None) -> float:
    """
    Composite RAG metric combining correctness and relevance.
    """
    # Answer correctness (F1)
    f1 = compute_f1(prediction.answer, example.answer)
    
    # Retrieval precision (are retrieved docs relevant?)
    precision = evaluate_retrieval_precision(
        retrieved=prediction.retrieved_docs,
        gold_docs=example.relevant_docs
    )
    
    # Answer faithfulness (is answer grounded in context?)
    faithfulness = llm_judge(
        context=prediction.context,
        answer=prediction.answer
    )
    
    # Weighted combination
    return 0.5 * f1 + 0.25 * precision + 0.25 * faithfulness

Next Steps

DSPy + GEPA Tutorial

Complete RAG optimization walkthrough

Generic RAG Adapter

Use GEPA with any vector store

Prompt Optimization

Optimize individual prompts

Agent Architecture

Discover multi-agent architectures

Build docs developers (and LLMs) love