DSPy Integration

GEPA provides seamless integration with DSPy through the DspyAdapter, enabling you to optimize instructions and prompts in your DSPy programs using LLM-guided evolution.

Overview

The DspyAdapter allows GEPA to:

Optimize DSPy predictor instructions
Provide per-predictor feedback for targeted improvements
Support tool-using modules (ReAct)
Capture DSPy traces for reflection
Leverage DSPy’s evaluation framework

Quick Start

import dspy
from gepa.adapters.dspy_adapter import DspyAdapter
from gepa import optimize

# 1. Define your DSPy program
class MathSolver(dspy.Module):
    def __init__(self):
        super().__init__()
        self.solve = dspy.ChainOfThought("question -> answer")
    
    def forward(self, question):
        return self.solve(question=question)

# 2. Create metric and feedback functions
def metric(example, prediction, trace=None):
    return 1.0 if example.answer in prediction.answer else 0.0

def provide_feedback(predictor_output, predictor_inputs, 
                     module_inputs, module_outputs, captured_trace):
    correct = module_inputs.answer in predictor_output.get("answer", "")
    if correct:
        return {"score": 1.0, "feedback": "Correct answer!"}
    else:
        return {
            "score": 0.0,
            "feedback": f"Wrong. Expected: {module_inputs.answer}"
        }

# 3. Setup adapter
student = MathSolver()
feedback_map = {"solve": provide_feedback}

adapter = DspyAdapter(
    student_module=student,
    metric_fn=metric,
    feedback_map=feedback_map,
)

# 4. Optimize
result = optimize(
    trainset=train_examples,
    valset=val_examples,
    adapter=adapter,
    reflection_lm=dspy.LM("openai/gpt-4o"),
    max_metric_calls=50,
)

# 5. Use optimized program
optimized_program = adapter.build_program(result.best_candidate)

DspyAdapter API

Constructor

DspyAdapter(
    student_module: dspy.Module,
    metric_fn: Callable,
    feedback_map: dict[str, Callable],
    failure_score: float = 0.0,
    num_threads: int | None = None,
    add_format_failure_as_feedback: bool = False,
    rng: random.Random | None = None,
    reflection_lm: dspy.LM | None = None,
    custom_instruction_proposer: ProposalFn | None = None,
    warn_on_score_mismatch: bool = True,
    enable_tool_optimization: bool = False,
    reflection_minibatch_size: int | None = None,
)

Parameters

student_module

The DSPy program to optimize.

class QASystem(dspy.Module):
    def __init__(self):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=3)
        self.generate = dspy.ChainOfThought("context, question -> answer")
    
    def forward(self, question):
        context = self.retrieve(question).passages
        return self.generate(context=context, question=question)

student = QASystem()

metric_fn

Program-level metric function. Must return a score (higher is better).

def metric(example, prediction, trace=None):
    """Simple exact match metric."""
    return 1.0 if example.answer == prediction.answer else 0.0

# Multi-objective metric
def advanced_metric(example, prediction, trace=None):
    exact = 1.0 if example.answer == prediction.answer else 0.0
    contains = 1.0 if example.answer in prediction.answer else 0.0
    
    return {
        "score": exact,
        "subscores": {
            "exact_match": exact,
            "contains": contains,
        }
    }

feedback_map

Mapping from predictor names to feedback functions. Each feedback function provides per-predictor diagnostic information.

def generate_feedback(predictor_output, predictor_inputs,
                      module_inputs, module_outputs, captured_trace):
    """
    Args:
        predictor_output: Output of this specific predictor
        predictor_inputs: Inputs to this specific predictor
        module_inputs: Original program inputs (Example)
        module_outputs: Final program outputs (Prediction)
        captured_trace: Full execution trace
    
    Returns:
        dict with 'score' and 'feedback' keys
    """
    is_correct = module_inputs.answer in predictor_output.get("answer", "")
    
    if is_correct:
        return {
            "score": 1.0,
            "feedback": "Generated correct answer."
        }
    else:
        return {
            "score": 0.0,
            "feedback": f"Expected '{module_inputs.answer}' but got '{predictor_output.get('answer', '')}'."
        }

feedback_map = {
    "generate": generate_feedback,
    # Add feedback for each predictor you want to optimize
}

Additional Parameters

failure_score: Default score for failed predictions (default: 0.0)
num_threads: Parallel evaluation threads for DSPy’s Evaluate
add_format_failure_as_feedback: Include parsing failures in feedback
reflection_lm: LM for proposing new instructions (defaults to dspy.settings.lm)
enable_tool_optimization: Enable optimization of tool descriptions in ReAct modules

Feedback Functions

Feedback functions are the key to effective DSPy optimization. They provide per-predictor guidance to the reflection LM.

Basic Feedback

def basic_feedback(predictor_output, predictor_inputs,
                   module_inputs, module_outputs, captured_trace):
    correct = module_inputs.expected in predictor_output.get("answer", "")
    
    return {
        "score": 1.0 if correct else 0.0,
        "feedback": "Correct!" if correct else f"Expected: {module_inputs.expected}"
    }

Detailed Feedback

def detailed_feedback(predictor_output, predictor_inputs,
                      module_inputs, module_outputs, captured_trace):
    answer = predictor_output.get("answer", "")
    expected = module_inputs.answer
    
    if expected in answer:
        feedback = f"✓ Correct answer found: {expected}"
        score = 1.0
    else:
        feedback = f"✗ Wrong answer.\nExpected: {expected}\nGot: {answer}"
        score = 0.0
    
    # Add reasoning quality assessment
    reasoning = predictor_output.get("reasoning", "")
    if len(reasoning) < 50:
        feedback += "\nReasoning is too brief. Provide more detailed steps."
    
    return {"score": score, "feedback": feedback}

Context-Aware Feedback

def context_aware_feedback(predictor_output, predictor_inputs,
                           module_inputs, module_outputs, captured_trace):
    # Access the full trace to understand context
    all_steps = [(p.signature, inputs, outputs) 
                 for p, inputs, outputs in captured_trace]
    
    answer = predictor_output.get("answer", "")
    expected = module_inputs.answer
    
    # Check if retrieval provided relevant context
    context_str = str(predictor_inputs.get("context", ""))
    has_relevant_context = expected in context_str
    
    if expected in answer:
        feedback = "Correct answer."
        score = 1.0
    else:
        if not has_relevant_context:
            feedback = f"The retrieved context didn't contain the answer '{expected}'. Consider using different keywords."
        else:
            feedback = f"The answer '{expected}' was in the context but not extracted correctly."
        score = 0.0
    
    return {"score": score, "feedback": feedback}

Complete Example

Multi-Hop Question Answering

import dspy
from gepa.adapters.dspy_adapter import DspyAdapter
from gepa import optimize

# Configure DSPy
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

# Define program
class MultiHopQA(dspy.Module):
    def __init__(self):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=3)
        self.generate_query = dspy.ChainOfThought(
            "question -> search_query"
        )
        self.answer = dspy.ChainOfThought(
            "context, question -> answer"
        )
    
    def forward(self, question):
        # First hop: generate search query
        query_pred = self.generate_query(question=question)
        
        # Retrieve context
        context = self.retrieve(query_pred.search_query).passages
        
        # Second hop: answer based on context
        return self.answer(context=context, question=question)

# Metric function
def multihop_metric(example, prediction, trace=None):
    if example.answer.lower() in prediction.answer.lower():
        return 1.0
    return 0.0

# Feedback functions for each predictor
def query_feedback(predictor_output, predictor_inputs,
                   module_inputs, module_outputs, captured_trace):
    query = predictor_output.get("search_query", "")
    
    # Check if final answer was correct
    final_correct = module_inputs.answer.lower() in module_outputs.answer.lower()
    
    if final_correct:
        return {
            "score": 1.0,
            "feedback": f"Good query: '{query}' led to correct answer."
        }
    else:
        return {
            "score": 0.0,
            "feedback": f"Query '{query}' didn't retrieve relevant info for answer: {module_inputs.answer}"
        }

def answer_feedback(predictor_output, predictor_inputs,
                    module_inputs, module_outputs, captured_trace):
    answer = predictor_output.get("answer", "")
    expected = module_inputs.answer
    context = str(predictor_inputs.get("context", ""))
    
    if expected.lower() in answer.lower():
        return {"score": 1.0, "feedback": "Correct answer extracted."}
    elif expected.lower() in context.lower():
        return {
            "score": 0.0,
            "feedback": f"Answer '{expected}' was in context but not extracted."
        }
    else:
        return {
            "score": 0.0,
            "feedback": f"Answer '{expected}' not in retrieved context."
        }

# Create dataset
train_examples = [
    dspy.Example(
        question="What is the capital of France?",
        answer="Paris"
    ).with_inputs("question"),
    # ... more examples
]

val_examples = [
    dspy.Example(
        question="Who wrote Romeo and Juliet?",
        answer="Shakespeare"
    ).with_inputs("question"),
    # ... more examples
]

# Setup and optimize
student = MultiHopQA()

adapter = DspyAdapter(
    student_module=student,
    metric_fn=multihop_metric,
    feedback_map={
        "generate_query": query_feedback,
        "answer": answer_feedback,
    },
    num_threads=4,
)

result = optimize(
    trainset=train_examples,
    valset=val_examples,
    adapter=adapter,
    reflection_lm=dspy.LM("openai/gpt-4o"),
    max_metric_calls=100,
)

# Get optimized program
optimized = adapter.build_program(result.best_candidate)

# Test it
test_question = "What is machine learning?"
prediction = optimized(question=test_question)
print(prediction.answer)

Tool Optimization (ReAct)

GEPA can optimize tool descriptions in ReAct modules:

adapter = DspyAdapter(
    student_module=student,
    metric_fn=metric,
    feedback_map=feedback_map,
    enable_tool_optimization=True,  # Enable tool optimization
)

Tool optimization improves:

Tool descriptions
Argument descriptions
When to use each tool

Custom Instruction Proposers

You can provide custom logic for proposing new instructions:

from gepa.core.adapter import ProposalFn

def custom_proposer(
    candidate: dict[str, str],
    reflective_dataset: dict[str, list[dict]],
    components_to_update: list[str]
) -> dict[str, str]:
    """
    Custom logic to propose improved instructions.
    
    Args:
        candidate: Current instruction values
        reflective_dataset: Feedback data per component
        components_to_update: Which components to update
    
    Returns:
        dict mapping component names to new instructions
    """
    # Your custom proposal logic here
    new_instructions = {}
    for comp in components_to_update:
        feedback = reflective_dataset[comp]
        # Analyze feedback and generate new instruction
        new_instructions[comp] = generate_improved_instruction(feedback)
    return new_instructions

adapter = DspyAdapter(
    student_module=student,
    metric_fn=metric,
    feedback_map=feedback_map,
    custom_instruction_proposer=custom_proposer,
)

Reflective Dataset Structure

The adapter creates reflective examples in this format:

{
    "Inputs": {
        "question": "What is ML?",
        "context": "Machine learning is...",
    },
    "Generated Outputs": {
        "answer": "ML is a subset of AI",
        "reasoning": "Based on the context...",
    },
    "Feedback": "Correct answer. Good reasoning."
}

For format failures:

{
    "Inputs": {...},
    "Generated Outputs": "Couldn't parse the output...",
    "Feedback": "Your output failed to parse. Follow this structure:\n..."
}

Best Practices

Provide specific, actionable feedback. Generic feedback like “Wrong answer” doesn’t help the LLM improve. Explain why it’s wrong and how to fix it.

Create feedback for all predictors you want to optimize
Use the full trace in feedback functions to provide context
Include expected outputs in feedback when predictions are wrong
Test your metric independently before optimization
Start with a small dataset to iterate quickly
Monitor progress by checking intermediate candidates

Troubleshooting

Score Mismatch Warning

If you see warnings about score mismatches:

adapter = DspyAdapter(
    ...,
    warn_on_score_mismatch=False,  # Disable if using LLM-as-judge
)

This is normal when:

Using non-deterministic metrics (LLM-as-judge)
Providing predictor-specific scores that differ from program-level scores

No Valid Predictions

If you get “No valid predictions found”:

Check your feedback functions return correct format

Enable format failure feedback:

adapter = DspyAdapter(..., add_format_failure_as_feedback=True)

Verify your program actually calls the predictors you’re optimizing

Next Steps

Adapter System

Learn about the adapter architecture

Custom Adapters

Create adapters for other frameworks

Evaluation Metrics

Design better feedback functions

Configuration

Fine-tune optimization parameters

Get Started

Core Concepts

Guides

Use Cases

DSPy Integration

Overview

Quick Start

DspyAdapter API

Constructor

Parameters

Feedback Functions

Basic Feedback

Detailed Feedback

Context-Aware Feedback

Complete Example

Multi-Hop Question Answering

Tool Optimization (ReAct)

Custom Instruction Proposers

Reflective Dataset Structure

Best Practices

Troubleshooting

Score Mismatch Warning

No Valid Predictions

Next Steps

Adapter System

Custom Adapters

Evaluation Metrics

Configuration

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Use Cases

​Overview

​Quick Start

​DspyAdapter API

​Constructor

​Parameters

​Feedback Functions

​Basic Feedback

​Detailed Feedback

​Context-Aware Feedback

​Complete Example

​Multi-Hop Question Answering

​Tool Optimization (ReAct)

​Custom Instruction Proposers

​Reflective Dataset Structure

​Best Practices

​Troubleshooting

​Score Mismatch Warning

​No Valid Predictions

​Next Steps

Adapter System

Custom Adapters

Evaluation Metrics

Configuration

Build docs developers (and LLMs) love

Overview

Quick Start

DspyAdapter API

Constructor

Parameters

Feedback Functions

Basic Feedback

Detailed Feedback

Context-Aware Feedback

Complete Example

Multi-Hop Question Answering

Tool Optimization (ReAct)

Custom Instruction Proposers

Reflective Dataset Structure

Best Practices

Troubleshooting

Score Mismatch Warning

No Valid Predictions

Next Steps