RAG Pipeline Optimization Tutorial

Learn how to optimize Retrieval-Augmented Generation (RAG) systems using GEPA’s Generic RAG Adapter. This tutorial covers optimization across multiple vector stores including ChromaDB, Weaviate, Qdrant, Milvus, and LanceDB.

Overview

The Generic RAG Adapter enables you to:

Optimize query reformulation, context synthesis, and answer generation prompts
Switch between vector stores with a single flag
Evaluate both retrieval quality (precision, recall, MRR) and generation quality (F1, BLEU, faithfulness)
Deploy optimized prompts in production

Install Dependencies

Install GEPA and vector store dependencies:

# Install GEPA core
pip install gepa

# Install all vector store dependencies (recommended)
pip install chromadb lancedb pyarrow pymilvus qdrant-client weaviate-client litellm

# Or install specific vector stores:
pip install litellm chromadb                    # ChromaDB (local, no Docker)
pip install litellm lancedb pyarrow             # LanceDB (serverless)
pip install litellm pymilvus                    # Milvus (local Lite mode)
pip install litellm qdrant-client               # Qdrant (in-memory)
pip install litellm weaviate-client             # Weaviate (requires Docker)

For local models:

# Install and pull Ollama models
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:8b
ollama pull llama3.1:8b
ollama pull nomic-embed-text:latest

Setup Vector Store

Choose and initialize a vector store. ChromaDB is easiest to start with:

from gepa.adapters.generic_rag_adapter import ChromaVectorStore
import tempfile

# Create ChromaDB instance (no Docker required)
temp_dir = tempfile.mkdtemp()
vector_store = ChromaVectorStore.create_local(
    persist_directory=temp_dir,
    collection_name="ai_ml_knowledge"
)

print(f"Created ChromaDB in {temp_dir}")

Other vector store options:

# LanceDB (serverless, no Docker)
from gepa.adapters.generic_rag_adapter import LanceDBVectorStore
vector_store = LanceDBVectorStore.create_local("./lancedb", "documents")

# Qdrant (in-memory, no Docker)
from gepa.adapters.generic_rag_adapter import QdrantVectorStore
vector_store = QdrantVectorStore.create_memory("documents")

# Milvus (local Lite mode, no Docker)
from gepa.adapters.generic_rag_adapter import MilvusVectorStore
vector_store = MilvusVectorStore.create_local("documents")

# Weaviate (requires Docker running)
from gepa.adapters.generic_rag_adapter import WeaviateVectorStore
vector_store = WeaviateVectorStore.create_local(
    host="localhost", port=8080, collection_name="Documents"
)

Add Knowledge Base Documents

Populate your vector store with documents:

documents = [
    {
        "content": "Machine Learning is a subset of artificial intelligence that "
                   "enables computers to learn and improve from experience without "
                   "being explicitly programmed.",
        "metadata": {
            "doc_id": "ml_basics",
            "topic": "machine_learning",
            "difficulty": "beginner"
        }
    },
    {
        "content": "Deep Learning is a subset of machine learning based on artificial "
                   "neural networks with representation learning. It can learn from data "
                   "that is unstructured or unlabeled.",
        "metadata": {
            "doc_id": "dl_basics",
            "topic": "deep_learning",
            "difficulty": "intermediate"
        }
    },
    {
        "content": "Natural Language Processing (NLP) is a branch of artificial "
                   "intelligence that helps computers understand, interpret and "
                   "manipulate human language.",
        "metadata": {
            "doc_id": "nlp_basics",
            "topic": "nlp",
            "difficulty": "intermediate"
        }
    },
    # Add more documents...
]

# Add to ChromaDB
vector_store.collection.add(
    documents=[doc["content"] for doc in documents],
    metadatas=[doc["metadata"] for doc in documents],
    ids=[doc["metadata"]["doc_id"] for doc in documents],
)

print(f"Added {len(documents)} documents to vector store")

Create Training Data

Define training and validation examples:

from gepa.adapters.generic_rag_adapter import RAGDataInst

train_data = [
    RAGDataInst(
        query="What is machine learning?",
        ground_truth_answer="Machine Learning is a method of data analysis that "
                           "automates analytical model building. It is a branch of AI "
                           "based on the idea that systems can learn from data.",
        relevant_doc_ids=["ml_basics"],
        metadata={"category": "definition", "difficulty": "beginner"}
    ),
    RAGDataInst(
        query="How does deep learning work?",
        ground_truth_answer="Deep Learning is a subset of machine learning based on "
                           "artificial neural networks with representation learning.",
        relevant_doc_ids=["dl_basics"],
        metadata={"category": "explanation", "difficulty": "intermediate"}
    ),
    # Add more training examples...
]

val_data = [
    RAGDataInst(
        query="What is natural language processing?",
        ground_truth_answer="NLP is a branch of AI that helps computers understand, "
                           "interpret and manipulate human language.",
        relevant_doc_ids=["nlp_basics"],
        metadata={"category": "definition", "difficulty": "intermediate"}
    ),
    # Add more validation examples...
]

print(f"Training examples: {len(train_data)}")
print(f"Validation examples: {len(val_data)}")

Create Initial Prompts

Define baseline prompts to optimize:

initial_prompts = {
    "answer_generation": """You are an AI expert providing accurate technical explanations.

Based on the retrieved context, provide a clear and informative answer to the user's question.

Guidelines:
- Use information from the provided context
- Be accurate and concise
- Include key technical details
- Structure your response clearly

Context: {context}

Question: {query}

Answer:"""
}

GEPA will evolve this into task-specific, highly effective prompts.

Setup RAG Adapter

Create the GenericRAGAdapter with your configuration:

from gepa.adapters.generic_rag_adapter import GenericRAGAdapter
import litellm

# Create LLM client
def create_llm_client(model_name):
    litellm.drop_params = True
    litellm.set_verbose = False
    
    def llm_client(messages_or_prompt, **kwargs):
        if isinstance(messages_or_prompt, str):
            messages = [{"role": "user", "content": messages_or_prompt}]
        else:
            messages = messages_or_prompt
        
        response = litellm.completion(
            model=model_name,
            messages=messages,
            max_tokens=kwargs.get("max_tokens", 400),
            temperature=kwargs.get("temperature", 0.1),
        )
        return response.choices[0].message.content.strip()
    
    return llm_client

llm_client = create_llm_client("ollama/qwen3:8b")  # Local model
# llm_client = create_llm_client("gpt-4o-mini")    # Or cloud model

# Configure RAG pipeline
rag_config = {
    "retrieval_strategy": "similarity",  # "similarity", "hybrid", "vector"
    "top_k": 3,                           # Number of documents to retrieve
    "retrieval_weight": 0.3,              # Weight for retrieval metrics
    "generation_weight": 0.7,             # Weight for generation metrics
}

# Create adapter
rag_adapter = GenericRAGAdapter(
    vector_store=vector_store,
    llm_model=llm_client,
    embedding_model="ollama/nomic-embed-text:latest",
    rag_config=rag_config,
)

print("RAG adapter initialized")

Test Initial Performance

Evaluate baseline prompts before optimization:

eval_result = rag_adapter.evaluate(
    batch=val_data[:1],
    candidate=initial_prompts,
    capture_traces=True
)

initial_score = eval_result.scores[0]
print(f"Initial validation score: {initial_score:.3f}")
print(f"Sample answer: {eval_result.outputs[0]['final_answer'][:200]}...")

Run GEPA Optimization

Optimize RAG prompts using GEPA:

import gepa

result = gepa.optimize(
    seed_candidate=initial_prompts,
    trainset=train_data,
    valset=val_data,
    adapter=rag_adapter,
    reflection_lm=llm_client,  # Can use different model for reflection
    max_metric_calls=10,        # Start small, increase for production
)

best_score = result.val_aggregate_scores[result.best_idx]
print(f"\nOptimization complete!")
print(f"Best validation score: {best_score:.3f}")
print(f"Improvement: {best_score - initial_score:+.3f}")
print(f"Total iterations: {result.total_metric_calls}")

Typical improvements:

Initial score: 0.35-0.50
Optimized score: 0.60-0.80
Improvement: +0.1 to +0.4 points

Review Optimized Prompts

Examine the optimized prompts:

print("\nOptimized Answer Generation Prompt:")
print("=" * 60)
print(result.best_candidate["answer_generation"])

# Test optimized prompts
optimized_result = rag_adapter.evaluate(
    batch=val_data[:1],
    candidate=result.best_candidate,
    capture_traces=False
)

print(f"\nOptimized answer: {optimized_result.outputs[0]['final_answer']}")

Complete Working Example

Use the unified RAG optimization script:

# Navigate to examples directory
cd src/gepa/examples/rag_adapter

# ChromaDB (default, no Docker required)
python rag_optimization.py --vector-store chromadb --max-iterations 10

# LanceDB (serverless, no Docker)
python rag_optimization.py --vector-store lancedb --max-iterations 10

# Qdrant (in-memory, no Docker)
python rag_optimization.py --vector-store qdrant --max-iterations 10

# Milvus (local Lite mode, no Docker)
python rag_optimization.py --vector-store milvus --max-iterations 10

# Weaviate (requires Docker running)
python rag_optimization.py --vector-store weaviate --max-iterations 10

# With cloud models
python rag_optimization.py --vector-store chromadb --model gpt-4o-mini --max-iterations 20

Vector Store Comparison

ChromaDB

Best for: Local development, prototyping

✅ No Docker required
✅ Simple setup
✅ Local persistent storage
Use: --vector-store chromadb

LanceDB

Best for: Serverless deployments

✅ No Docker required
✅ Developer-friendly
✅ Columnar format performance
Use: --vector-store lancedb

Qdrant

Best for: High performance, filtering

✅ No Docker (in-memory mode)
✅ Advanced metadata filtering
✅ Payload search
Use: --vector-store qdrant

Weaviate

Best for: Production, hybrid search

⚠️ Requires Docker
✅ Hybrid semantic + keyword search
✅ Production-ready clustering
Use: --vector-store weaviate

Evaluation Metrics

The RAG adapter tracks comprehensive metrics:

Retrieval Quality

Precision: Fraction of retrieved documents that are relevant
Recall: Fraction of relevant documents that were retrieved
F1 Score: Harmonic mean of precision and recall
MRR: Mean Reciprocal Rank for ranking quality

Generation Quality

Token F1: Token overlap with ground truth
BLEU Score: N-gram similarity measure
Answer Relevance: How well answer relates to context
Faithfulness: How well answer is supported by context

Combined Score

final_score = (retrieval_weight × retrieval_f1) + (generation_weight × generation_score)

Configure weights in rag_config:

rag_config = {
    "retrieval_weight": 0.3,  # Emphasis on retrieval
    "generation_weight": 0.7,  # Emphasis on generation
}

Optimizable Components

You can optimize multiple RAG components simultaneously:

initial_prompts = {
    # Query reformulation
    "query_reformulation": """Enhance the user query for better retrieval:
    - Add relevant technical terms
    - Make query more specific
    - Preserve original intent
    
    Query: {query}""",
    
    # Context synthesis
    "context_synthesis": """Synthesize retrieved documents into coherent context:
    - Focus on query-relevant information
    - Integrate information from multiple sources
    - Remove redundant content
    
    Documents: {documents}""",
    
    # Answer generation (shown earlier)
    "answer_generation": "...",
    
    # Document reranking
    "reranking_criteria": """Rank documents by relevance:
    - Direct answers get highest priority
    - Comprehensive explanations rank second
    - Supporting examples rank third
    
    Query: {query}
    Documents: {documents}""",
}

GEPA will optimize all components together for maximum effectiveness.

Advanced Configuration

Hybrid Search (Weaviate)

rag_config = {
    "retrieval_strategy": "hybrid",
    "hybrid_alpha": 0.7,  # 0.0=keyword, 1.0=semantic, 0.5=balanced
    "top_k": 5,
}

Metadata Filtering

rag_config = {
    "retrieval_strategy": "similarity",
    "top_k": 3,
    "filters": {"difficulty": "beginner"}  # Filter by metadata
}

Production Deployment

def create_production_adapter(env: str):
    if env == "development":
        vector_store = ChromaVectorStore.create_local("./local_kb", "docs")
        config = {"retrieval_strategy": "similarity", "top_k": 3}
        llm_model = "ollama/llama3.2:1b"
    elif env == "production":
        vector_store = WeaviateVectorStore.create_cloud(
            cluster_url=os.getenv("WEAVIATE_URL"),
            auth_credentials=weaviate.AuthApiKey(os.getenv("WEAVIATE_KEY")),
            collection_name="ProductionKB"
        )
        config = {
            "retrieval_strategy": "hybrid",
            "hybrid_alpha": 0.75,
            "top_k": 5,
        }
        llm_model = "gpt-4o"
    
    return GenericRAGAdapter(
        vector_store=vector_store,
        llm_model=llm_model,
        rag_config=config
    )

Troubleshooting

Vector store connection errors

ChromaDB: No external dependencies, should work out of boxWeaviate: Ensure Docker is running

docker run -p 8080:8080 -p 50051:50051 cr.weaviate.io/semitechnologies/weaviate:1.26.1
curl http://localhost:8080/v1/meta

Qdrant: In-memory mode requires no setup; for server mode:

docker run -p 6333:6333 qdrant/qdrant

Low retrieval scores

Increase top_k to retrieve more documents
Check that relevant_doc_ids in training data are correct
Ensure documents are properly indexed in vector store
Try different retrieval strategies (similarity vs. hybrid)

Poor generation quality

Verify ground truth answers are high quality
Increase generation_weight in config
Use a stronger LLM for generation
Add more diverse training examples

Next Steps

RAG Adapter API

Complete API reference for the Generic RAG Adapter

Vector Store Guide

Detailed setup for all supported vector stores

Agent Architecture

Optimize entire agent systems beyond RAG

Production Examples

Real-world RAG deployments using GEPA

Tutorials

Integrations

RAG Pipeline Optimization

RAG Pipeline Optimization Tutorial

Overview

Complete Working Example

Vector Store Comparison

ChromaDB

LanceDB

Qdrant

Weaviate

Evaluation Metrics

Retrieval Quality

Generation Quality

Combined Score

Optimizable Components

Advanced Configuration

Hybrid Search (Weaviate)

Metadata Filtering

Production Deployment

Troubleshooting

Next Steps

RAG Adapter API

Vector Store Guide

Agent Architecture

Production Examples

Build docs developers (and LLMs) love

Tutorials

Integrations

​RAG Pipeline Optimization Tutorial

​Overview

​Complete Working Example

​Vector Store Comparison

ChromaDB

LanceDB

Qdrant

Weaviate

​Evaluation Metrics

​Retrieval Quality

​Generation Quality

​Combined Score

​Optimizable Components

​Advanced Configuration

​Hybrid Search (Weaviate)

​Metadata Filtering

​Production Deployment

​Troubleshooting

​Next Steps

RAG Adapter API

Vector Store Guide

Agent Architecture

Production Examples

Build docs developers (and LLMs) love

RAG Pipeline Optimization Tutorial

Overview

Complete Working Example

Vector Store Comparison

Evaluation Metrics

Retrieval Quality

Generation Quality

Combined Score

Optimizable Components

Advanced Configuration

Hybrid Search (Weaviate)

Metadata Filtering

Production Deployment

Troubleshooting

Next Steps