Reflective Evolution

The Core Insight

Traditional optimizers receive only scalar feedback:

# Traditional optimizer
candidate = "def solve(x): return x + 1"
score = evaluate(candidate)  # Returns: 0.3
# Now what? We know it's bad, but not why.

GEPA receives rich diagnostic feedback that explains failures:

# GEPA evaluator
def evaluate(candidate, example):
    result = run_code(candidate, example["input"])
    score = 1.0 if result == example["expected"] else 0.0
    
    return score, {
        "Input": example["input"],
        "Output": result,
        "Expected": example["expected"],
        "Error": result.error if hasattr(result, 'error') else None
    }

This diagnostic information — called Actionable Side Information (ASI) — enables an LLM to:

Diagnose why the candidate failed
Identify patterns across multiple failures
Propose targeted fixes rather than random mutations

The Reflective Mutation Process

Reflective mutation is GEPA’s primary candidate improvement strategy. It consists of four phases:

Phase 1: Candidate Selection

Select a candidate from the Pareto frontier to evolve from:

# From candidate_selector.py
class ParetoCandidateSelector:
    def select_candidate_idx(self, state: GEPAState) -> int:
        # Get all candidates on the Pareto front
        pareto_programs = state.get_pareto_front_programs()
        
        # Randomly select one
        return self.rng.choice(list(pareto_programs))

Strategy options:

pareto (default): Randomly select from Pareto front for diverse exploration
current_best: Always select the single best-scoring candidate (greedy)
epsilon_greedy: Select best with probability (1-ε), random from front otherwise

Phase 2: Trace Capture & Reflection

Evaluate the selected candidate on a small minibatch (1-3 examples) with full trace capture:

# From reflective_mutation.py
subsample_ids = batch_sampler.next_minibatch_ids(trainset, state)
minibatch = trainset.fetch(subsample_ids)

# Evaluate with traces
eval_result = adapter.evaluate(
    batch=minibatch,
    candidate=current_candidate,
    capture_traces=True  # Crucial: captures execution details
)

The adapter’s evaluate() method returns:

outputs: Raw outputs for each example
scores: Numeric scores (higher is better)
trajectories: Rich execution traces (adapter-defined structure)
objective_scores (optional): Multi-objective metrics

Trace capture is more expensive than scoring alone. GEPA only captures traces for the minibatch during reflection, not during full validation evaluation.

Phase 3: Build Reflective Dataset

The adapter extracts meaningful feedback from traces:

# From adapter.py protocol
reflective_dataset = adapter.make_reflective_dataset(
    candidate=current_candidate,
    eval_batch=eval_result,
    components_to_update=["system_prompt"]  # Which params to evolve
)

# Returns a structured dataset, e.g.:
{
    "system_prompt": [
        {
            "Inputs": {
                "question": "What is 2+2?",
                "context": "Elementary arithmetic"
            },
            "Generated Outputs": "The answer is 5",
            "Feedback": "Incorrect. Expected: 4. Error: Basic arithmetic failure."
        },
        {
            "Inputs": {"question": "What is 10*10?"},
            "Generated Outputs": "100",
            "Feedback": "Correct!"
        },
        # ... more examples
    ]
}

Best practices for reflective datasets:

Include both failures and successes: Show what works and what doesn’t
Provide specific feedback: “Wrong answer: expected 42, got 17” beats “Incorrect”
Include error messages: Compiler errors, exceptions, validation failures
Add context: Inputs, intermediate steps, expected outputs
Keep it concise: 3-5 examples usually suffice; quality over quantity

Phase 4: LLM-Based Proposal

The reflection LM generates an improved candidate:

# From instruction_proposal.py
def propose_new_texts(
    candidate: dict[str, str],
    reflective_dataset: dict[str, list[dict]],
    components_to_update: list[str]
) -> dict[str, str]:
    new_texts = {}
    
    for component_name in components_to_update:
        # Build prompt with current text + reflective dataset
        prompt = build_reflection_prompt(
            current_text=candidate[component_name],
            examples=reflective_dataset[component_name]
        )
        
        # LLM proposes improvement
        new_text = reflection_lm(prompt)
        new_texts[component_name] = new_text
    
    return new_texts

The default reflection prompt template (from optimize_anything.py):

I am optimizing a parameter in my system. The current parameter value is:

curr_param


Below is evaluation data showing how this parameter performed:

side_info

Your task is to propose a new, improved parameter value.

Carefully analyze all evaluation data. Look for patterns:
- Performance metrics and correlations
- Recurring issues or failure patterns
- Successful behaviors to preserve
- Domain-specific constraints

Based on your analysis, propose a new parameter value that addresses
the identified issues while maintaining what works well.

Provide the new parameter value within ``` blocks.

You can customize the reflection prompt template via ReflectionConfig.reflection_prompt_template. Use <curr_param> and <side_info> as placeholders.

Component Selection Strategies

When optimizing multiple parameters simultaneously, GEPA selects which components to update each iteration:

Round Robin (Default)

# From component_selector.py
class RoundRobinReflectionComponentSelector:
    def select_components(
        self, candidate: dict[str, str], state: GEPAState
    ) -> list[str]:
        # Cycle through components in order
        predictor_names = list(candidate.keys())
        next_id = state.named_predictor_id_to_update_next_for_program_candidate[
            candidate_idx
        ]
        selected = [predictor_names[next_id]]
        # Update for next iteration
        state.named_predictor_id_to_update_next_for_program_candidate[
            candidate_idx
        ] = (next_id + 1) % len(predictor_names)
        return selected

When to use: Multi-component systems where each parameter has independent impact.

All Components

class AllReflectionComponentSelector:
    def select_components(
        self, candidate: dict[str, str], state: GEPAState
    ) -> list[str]:
        # Update all components together
        return list(candidate.keys())

When to use: Tightly coupled parameters that should co-evolve (e.g., prompt + few-shot examples).

Custom Selector

Implement your own selection logic:

from gepa.proposer.reflective_mutation.base import ReflectionComponentSelector

class CustomSelector(ReflectionComponentSelector):
    def select_components(
        self, candidate: dict[str, str], state: GEPAState
    ) -> list[str]:
        # Your custom logic
        # Example: prioritize components with lower scores
        worst_performing = find_worst_component(state)
        return [worst_performing]

config = GEPAConfig(
    reflection=ReflectionConfig(
        module_selector=CustomSelector()
    )
)

Advanced Reflection Features

Skip Perfect Scores

Avoid wasting budget on candidates that already achieve perfect scores:

config = GEPAConfig(
    reflection=ReflectionConfig(
        skip_perfect_score=True,
        perfect_score=1.0  # What counts as "perfect"
    )
)

When enabled, if all minibatch examples score perfectly, GEPA skips reflection and samples a new minibatch.

Custom Reflection Prompts

Use different prompts for different parameters:

config = GEPAConfig(
    reflection=ReflectionConfig(
        reflection_prompt_template={
            "system_prompt": """
                You are optimizing a system prompt for math reasoning.
                Current prompt:
                <curr_param>
                
                Performance data:
                <side_info>
                
                Propose an improved prompt that fixes arithmetic errors
                while preserving chain-of-thought reasoning.
            """,
            "few_shot_examples": """
                You are curating few-shot examples.
                Current examples:
                <curr_param>
                
                Performance data:
                <side_info>
                
                Propose better examples that cover edge cases.
            """
        }
    )
)

Custom Candidate Proposer

Replace the default LLM-based proposer entirely:

from collections.abc import Mapping, Sequence

def my_proposer(
    candidate: dict[str, str],
    reflective_dataset: Mapping[str, Sequence[dict]],
    components_to_update: list[str]
) -> dict[str, str]:
    """
    Custom proposal logic.
    Could use: DSPy signatures, fine-tuned models,
    rule-based systems, genetic operators, etc.
    """
    new_texts = {}
    for component in components_to_update:
        # Your custom logic
        new_texts[component] = generate_improvement(
            candidate[component],
            reflective_dataset[component]
        )
    return new_texts

config = GEPAConfig(
    reflection=ReflectionConfig(
        custom_candidate_proposer=my_proposer
    )
)

Reflection vs Traditional Mutation

Traditional evolutionary algorithms use blind mutation:

Traditional EA	GEPA Reflective Mutation
Random character flips	Targeted fixes based on error analysis
”the cat sat” → “tze cat sat"	"Fix: ‘sat’ should be ‘sits’ for subject-verb agreement”
No failure diagnosis	Full trace analysis
Requires 1000s of evaluations	Typically 100-500 evaluations
Cannot explain changes	Human-interpretable edit rationale

Example from AIME math benchmark:

Iteration 5: Reflection identifies:
- Failing on base conversion problems (3/10 correct)
- Success on polynomial problems (8/10 correct)
- Common error: not enforcing digit constraints in base-b

Proposed improvement:
"For base conversion: enforce digit ranges strictly.
In base b, digits ∈ {0,...,b-1}. Translate positional
notation correctly: (abc)_b = a·b² + b·b + c..."

Result: Base conversion accuracy improves from 30% → 80%

Minibatch Size Impact

The reflection minibatch size controls the focus-vs-breadth tradeoff: Small batches (1-3 examples):

✅ Focused, targeted improvements
✅ LLM can deeply analyze each failure
✅ Faster iterations (less evaluation time)
❌ May overfit to shown examples

Large batches (10+ examples):

✅ Broader pattern recognition
✅ Less likely to overfit
❌ LLM prompt gets very long
❌ Harder to identify specific root causes
❌ Slower iterations

Recommended: Start with reflection_minibatch_size=3 (default). Increase for noisy evaluations, decrease for expensive evaluations.

config = GEPAConfig(
    reflection=ReflectionConfig(
        reflection_minibatch_size=3  # 1-3 for most tasks
    )
)

Seedless Mode

GEPA can bootstrap the initial candidate from your objective:

result = optimize_anything(
    seed_candidate=None,  # No starting point
    evaluator=evaluate,
    objective="Generate a Python function that reverses a string",
    background="Use only built-in Python, no imports. Handle Unicode correctly.",
    dataset=test_cases
)

The reflection LM generates the first candidate, then GEPA iterates normally. Useful for:

Creative tasks with no obvious starting point
Exploratory optimization
Bootstrapping from high-level requirements

Next Steps

Pareto Optimization

Learn how GEPA maintains diverse candidates via Pareto frontiers

Actionable Side Information

Deep dive into ASI: the gradient of text optimization

Adapters

Build your own adapter to integrate GEPA with your system

Advanced Configuration

Fine-tune reflection parameters for your use case

Get Started

Core Concepts

Guides

Use Cases

Reflective Evolution

The Core Insight

The Reflective Mutation Process

Phase 1: Candidate Selection

Phase 2: Trace Capture & Reflection

Phase 3: Build Reflective Dataset

Phase 4: LLM-Based Proposal

Component Selection Strategies

Round Robin (Default)

All Components

Custom Selector

Advanced Reflection Features

Skip Perfect Scores

Custom Reflection Prompts

Custom Candidate Proposer

Reflection vs Traditional Mutation

Minibatch Size Impact

Seedless Mode

Next Steps

Pareto Optimization

Actionable Side Information

Adapters

Advanced Configuration

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Use Cases

​The Core Insight

​The Reflective Mutation Process

​Phase 1: Candidate Selection

​Phase 2: Trace Capture & Reflection

​Phase 3: Build Reflective Dataset

​Phase 4: LLM-Based Proposal

​Component Selection Strategies

​Round Robin (Default)

​All Components

​Custom Selector

​Advanced Reflection Features

​Skip Perfect Scores

​Custom Reflection Prompts

​Custom Candidate Proposer

​Reflection vs Traditional Mutation

​Minibatch Size Impact

​Seedless Mode

​Next Steps

Pareto Optimization

Actionable Side Information

Adapters

Advanced Configuration

Build docs developers (and LLMs) love

The Core Insight

The Reflective Mutation Process

Phase 1: Candidate Selection

Phase 2: Trace Capture & Reflection

Phase 3: Build Reflective Dataset

Phase 4: LLM-Based Proposal

Component Selection Strategies

Round Robin (Default)

All Components

Custom Selector

Advanced Reflection Features

Skip Perfect Scores

Custom Reflection Prompts

Custom Candidate Proposer

Reflection vs Traditional Mutation

Minibatch Size Impact

Seedless Mode

Next Steps