Skip to main content
GEPA provides seamless integration with DSPy through the DspyAdapter, enabling you to optimize instructions and prompts in your DSPy programs using LLM-guided evolution.

Overview

The DspyAdapter allows GEPA to:
  • Optimize DSPy predictor instructions
  • Provide per-predictor feedback for targeted improvements
  • Support tool-using modules (ReAct)
  • Capture DSPy traces for reflection
  • Leverage DSPy’s evaluation framework

Quick Start

import dspy
from gepa.adapters.dspy_adapter import DspyAdapter
from gepa import optimize

# 1. Define your DSPy program
class MathSolver(dspy.Module):
    def __init__(self):
        super().__init__()
        self.solve = dspy.ChainOfThought("question -> answer")
    
    def forward(self, question):
        return self.solve(question=question)

# 2. Create metric and feedback functions
def metric(example, prediction, trace=None):
    return 1.0 if example.answer in prediction.answer else 0.0

def provide_feedback(predictor_output, predictor_inputs, 
                     module_inputs, module_outputs, captured_trace):
    correct = module_inputs.answer in predictor_output.get("answer", "")
    if correct:
        return {"score": 1.0, "feedback": "Correct answer!"}
    else:
        return {
            "score": 0.0,
            "feedback": f"Wrong. Expected: {module_inputs.answer}"
        }

# 3. Setup adapter
student = MathSolver()
feedback_map = {"solve": provide_feedback}

adapter = DspyAdapter(
    student_module=student,
    metric_fn=metric,
    feedback_map=feedback_map,
)

# 4. Optimize
result = optimize(
    trainset=train_examples,
    valset=val_examples,
    adapter=adapter,
    reflection_lm=dspy.LM("openai/gpt-4o"),
    max_metric_calls=50,
)

# 5. Use optimized program
optimized_program = adapter.build_program(result.best_candidate)

DspyAdapter API

Constructor

DspyAdapter(
    student_module: dspy.Module,
    metric_fn: Callable,
    feedback_map: dict[str, Callable],
    failure_score: float = 0.0,
    num_threads: int | None = None,
    add_format_failure_as_feedback: bool = False,
    rng: random.Random | None = None,
    reflection_lm: dspy.LM | None = None,
    custom_instruction_proposer: ProposalFn | None = None,
    warn_on_score_mismatch: bool = True,
    enable_tool_optimization: bool = False,
    reflection_minibatch_size: int | None = None,
)

Parameters

1

student_module

The DSPy program to optimize.
class QASystem(dspy.Module):
    def __init__(self):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=3)
        self.generate = dspy.ChainOfThought("context, question -> answer")
    
    def forward(self, question):
        context = self.retrieve(question).passages
        return self.generate(context=context, question=question)

student = QASystem()
2

metric_fn

Program-level metric function. Must return a score (higher is better).
def metric(example, prediction, trace=None):
    """Simple exact match metric."""
    return 1.0 if example.answer == prediction.answer else 0.0

# Multi-objective metric
def advanced_metric(example, prediction, trace=None):
    exact = 1.0 if example.answer == prediction.answer else 0.0
    contains = 1.0 if example.answer in prediction.answer else 0.0
    
    return {
        "score": exact,
        "subscores": {
            "exact_match": exact,
            "contains": contains,
        }
    }
3

feedback_map

Mapping from predictor names to feedback functions. Each feedback function provides per-predictor diagnostic information.
def generate_feedback(predictor_output, predictor_inputs,
                      module_inputs, module_outputs, captured_trace):
    """
    Args:
        predictor_output: Output of this specific predictor
        predictor_inputs: Inputs to this specific predictor
        module_inputs: Original program inputs (Example)
        module_outputs: Final program outputs (Prediction)
        captured_trace: Full execution trace
    
    Returns:
        dict with 'score' and 'feedback' keys
    """
    is_correct = module_inputs.answer in predictor_output.get("answer", "")
    
    if is_correct:
        return {
            "score": 1.0,
            "feedback": "Generated correct answer."
        }
    else:
        return {
            "score": 0.0,
            "feedback": f"Expected '{module_inputs.answer}' but got '{predictor_output.get('answer', '')}'."
        }

feedback_map = {
    "generate": generate_feedback,
    # Add feedback for each predictor you want to optimize
}
4

Additional Parameters

  • failure_score: Default score for failed predictions (default: 0.0)
  • num_threads: Parallel evaluation threads for DSPy’s Evaluate
  • add_format_failure_as_feedback: Include parsing failures in feedback
  • reflection_lm: LM for proposing new instructions (defaults to dspy.settings.lm)
  • enable_tool_optimization: Enable optimization of tool descriptions in ReAct modules

Feedback Functions

Feedback functions are the key to effective DSPy optimization. They provide per-predictor guidance to the reflection LM.

Basic Feedback

def basic_feedback(predictor_output, predictor_inputs,
                   module_inputs, module_outputs, captured_trace):
    correct = module_inputs.expected in predictor_output.get("answer", "")
    
    return {
        "score": 1.0 if correct else 0.0,
        "feedback": "Correct!" if correct else f"Expected: {module_inputs.expected}"
    }

Detailed Feedback

def detailed_feedback(predictor_output, predictor_inputs,
                      module_inputs, module_outputs, captured_trace):
    answer = predictor_output.get("answer", "")
    expected = module_inputs.answer
    
    if expected in answer:
        feedback = f"✓ Correct answer found: {expected}"
        score = 1.0
    else:
        feedback = f"✗ Wrong answer.\nExpected: {expected}\nGot: {answer}"
        score = 0.0
    
    # Add reasoning quality assessment
    reasoning = predictor_output.get("reasoning", "")
    if len(reasoning) < 50:
        feedback += "\nReasoning is too brief. Provide more detailed steps."
    
    return {"score": score, "feedback": feedback}

Context-Aware Feedback

def context_aware_feedback(predictor_output, predictor_inputs,
                           module_inputs, module_outputs, captured_trace):
    # Access the full trace to understand context
    all_steps = [(p.signature, inputs, outputs) 
                 for p, inputs, outputs in captured_trace]
    
    answer = predictor_output.get("answer", "")
    expected = module_inputs.answer
    
    # Check if retrieval provided relevant context
    context_str = str(predictor_inputs.get("context", ""))
    has_relevant_context = expected in context_str
    
    if expected in answer:
        feedback = "Correct answer."
        score = 1.0
    else:
        if not has_relevant_context:
            feedback = f"The retrieved context didn't contain the answer '{expected}'. Consider using different keywords."
        else:
            feedback = f"The answer '{expected}' was in the context but not extracted correctly."
        score = 0.0
    
    return {"score": score, "feedback": feedback}

Complete Example

Multi-Hop Question Answering

import dspy
from gepa.adapters.dspy_adapter import DspyAdapter
from gepa import optimize

# Configure DSPy
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

# Define program
class MultiHopQA(dspy.Module):
    def __init__(self):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=3)
        self.generate_query = dspy.ChainOfThought(
            "question -> search_query"
        )
        self.answer = dspy.ChainOfThought(
            "context, question -> answer"
        )
    
    def forward(self, question):
        # First hop: generate search query
        query_pred = self.generate_query(question=question)
        
        # Retrieve context
        context = self.retrieve(query_pred.search_query).passages
        
        # Second hop: answer based on context
        return self.answer(context=context, question=question)

# Metric function
def multihop_metric(example, prediction, trace=None):
    if example.answer.lower() in prediction.answer.lower():
        return 1.0
    return 0.0

# Feedback functions for each predictor
def query_feedback(predictor_output, predictor_inputs,
                   module_inputs, module_outputs, captured_trace):
    query = predictor_output.get("search_query", "")
    
    # Check if final answer was correct
    final_correct = module_inputs.answer.lower() in module_outputs.answer.lower()
    
    if final_correct:
        return {
            "score": 1.0,
            "feedback": f"Good query: '{query}' led to correct answer."
        }
    else:
        return {
            "score": 0.0,
            "feedback": f"Query '{query}' didn't retrieve relevant info for answer: {module_inputs.answer}"
        }

def answer_feedback(predictor_output, predictor_inputs,
                    module_inputs, module_outputs, captured_trace):
    answer = predictor_output.get("answer", "")
    expected = module_inputs.answer
    context = str(predictor_inputs.get("context", ""))
    
    if expected.lower() in answer.lower():
        return {"score": 1.0, "feedback": "Correct answer extracted."}
    elif expected.lower() in context.lower():
        return {
            "score": 0.0,
            "feedback": f"Answer '{expected}' was in context but not extracted."
        }
    else:
        return {
            "score": 0.0,
            "feedback": f"Answer '{expected}' not in retrieved context."
        }

# Create dataset
train_examples = [
    dspy.Example(
        question="What is the capital of France?",
        answer="Paris"
    ).with_inputs("question"),
    # ... more examples
]

val_examples = [
    dspy.Example(
        question="Who wrote Romeo and Juliet?",
        answer="Shakespeare"
    ).with_inputs("question"),
    # ... more examples
]

# Setup and optimize
student = MultiHopQA()

adapter = DspyAdapter(
    student_module=student,
    metric_fn=multihop_metric,
    feedback_map={
        "generate_query": query_feedback,
        "answer": answer_feedback,
    },
    num_threads=4,
)

result = optimize(
    trainset=train_examples,
    valset=val_examples,
    adapter=adapter,
    reflection_lm=dspy.LM("openai/gpt-4o"),
    max_metric_calls=100,
)

# Get optimized program
optimized = adapter.build_program(result.best_candidate)

# Test it
test_question = "What is machine learning?"
prediction = optimized(question=test_question)
print(prediction.answer)

Tool Optimization (ReAct)

GEPA can optimize tool descriptions in ReAct modules:
adapter = DspyAdapter(
    student_module=student,
    metric_fn=metric,
    feedback_map=feedback_map,
    enable_tool_optimization=True,  # Enable tool optimization
)
Tool optimization improves:
  • Tool descriptions
  • Argument descriptions
  • When to use each tool

Custom Instruction Proposers

You can provide custom logic for proposing new instructions:
from gepa.core.adapter import ProposalFn

def custom_proposer(
    candidate: dict[str, str],
    reflective_dataset: dict[str, list[dict]],
    components_to_update: list[str]
) -> dict[str, str]:
    """
    Custom logic to propose improved instructions.
    
    Args:
        candidate: Current instruction values
        reflective_dataset: Feedback data per component
        components_to_update: Which components to update
    
    Returns:
        dict mapping component names to new instructions
    """
    # Your custom proposal logic here
    new_instructions = {}
    for comp in components_to_update:
        feedback = reflective_dataset[comp]
        # Analyze feedback and generate new instruction
        new_instructions[comp] = generate_improved_instruction(feedback)
    return new_instructions

adapter = DspyAdapter(
    student_module=student,
    metric_fn=metric,
    feedback_map=feedback_map,
    custom_instruction_proposer=custom_proposer,
)

Reflective Dataset Structure

The adapter creates reflective examples in this format:
{
    "Inputs": {
        "question": "What is ML?",
        "context": "Machine learning is...",
    },
    "Generated Outputs": {
        "answer": "ML is a subset of AI",
        "reasoning": "Based on the context...",
    },
    "Feedback": "Correct answer. Good reasoning."
}
For format failures:
{
    "Inputs": {...},
    "Generated Outputs": "Couldn't parse the output...",
    "Feedback": "Your output failed to parse. Follow this structure:\n..."
}

Best Practices

Provide specific, actionable feedback. Generic feedback like “Wrong answer” doesn’t help the LLM improve. Explain why it’s wrong and how to fix it.
  1. Create feedback for all predictors you want to optimize
  2. Use the full trace in feedback functions to provide context
  3. Include expected outputs in feedback when predictions are wrong
  4. Test your metric independently before optimization
  5. Start with a small dataset to iterate quickly
  6. Monitor progress by checking intermediate candidates

Troubleshooting

Score Mismatch Warning

If you see warnings about score mismatches:
adapter = DspyAdapter(
    ...,
    warn_on_score_mismatch=False,  # Disable if using LLM-as-judge
)
This is normal when:
  • Using non-deterministic metrics (LLM-as-judge)
  • Providing predictor-specific scores that differ from program-level scores

No Valid Predictions

If you get “No valid predictions found”:
  1. Check your feedback functions return correct format
  2. Enable format failure feedback:
    adapter = DspyAdapter(..., add_format_failure_as_feedback=True)
    
  3. Verify your program actually calls the predictors you’re optimizing

Next Steps

Adapter System

Learn about the adapter architecture

Custom Adapters

Create adapters for other frameworks

Evaluation Metrics

Design better feedback functions

Configuration

Fine-tune optimization parameters

Build docs developers (and LLMs) love