Skip to main content

The Core Insight

Traditional optimizers receive only scalar feedback:
# Traditional optimizer
candidate = "def solve(x): return x + 1"
score = evaluate(candidate)  # Returns: 0.3
# Now what? We know it's bad, but not why.
GEPA receives rich diagnostic feedback that explains failures:
# GEPA evaluator
def evaluate(candidate, example):
    result = run_code(candidate, example["input"])
    score = 1.0 if result == example["expected"] else 0.0
    
    return score, {
        "Input": example["input"],
        "Output": result,
        "Expected": example["expected"],
        "Error": result.error if hasattr(result, 'error') else None
    }
This diagnostic information — called Actionable Side Information (ASI) — enables an LLM to:
  • Diagnose why the candidate failed
  • Identify patterns across multiple failures
  • Propose targeted fixes rather than random mutations

The Reflective Mutation Process

Reflective mutation is GEPA’s primary candidate improvement strategy. It consists of four phases:

Phase 1: Candidate Selection

Select a candidate from the Pareto frontier to evolve from:
# From candidate_selector.py
class ParetoCandidateSelector:
    def select_candidate_idx(self, state: GEPAState) -> int:
        # Get all candidates on the Pareto front
        pareto_programs = state.get_pareto_front_programs()
        
        # Randomly select one
        return self.rng.choice(list(pareto_programs))
Strategy options:
  • pareto (default): Randomly select from Pareto front for diverse exploration
  • current_best: Always select the single best-scoring candidate (greedy)
  • epsilon_greedy: Select best with probability (1-ε), random from front otherwise

Phase 2: Trace Capture & Reflection

Evaluate the selected candidate on a small minibatch (1-3 examples) with full trace capture:
# From reflective_mutation.py
subsample_ids = batch_sampler.next_minibatch_ids(trainset, state)
minibatch = trainset.fetch(subsample_ids)

# Evaluate with traces
eval_result = adapter.evaluate(
    batch=minibatch,
    candidate=current_candidate,
    capture_traces=True  # Crucial: captures execution details
)
The adapter’s evaluate() method returns:
  • outputs: Raw outputs for each example
  • scores: Numeric scores (higher is better)
  • trajectories: Rich execution traces (adapter-defined structure)
  • objective_scores (optional): Multi-objective metrics
Trace capture is more expensive than scoring alone. GEPA only captures traces for the minibatch during reflection, not during full validation evaluation.

Phase 3: Build Reflective Dataset

The adapter extracts meaningful feedback from traces:
# From adapter.py protocol
reflective_dataset = adapter.make_reflective_dataset(
    candidate=current_candidate,
    eval_batch=eval_result,
    components_to_update=["system_prompt"]  # Which params to evolve
)

# Returns a structured dataset, e.g.:
{
    "system_prompt": [
        {
            "Inputs": {
                "question": "What is 2+2?",
                "context": "Elementary arithmetic"
            },
            "Generated Outputs": "The answer is 5",
            "Feedback": "Incorrect. Expected: 4. Error: Basic arithmetic failure."
        },
        {
            "Inputs": {"question": "What is 10*10?"},
            "Generated Outputs": "100",
            "Feedback": "Correct!"
        },
        # ... more examples
    ]
}
Best practices for reflective datasets:
  1. Include both failures and successes: Show what works and what doesn’t
  2. Provide specific feedback: “Wrong answer: expected 42, got 17” beats “Incorrect”
  3. Include error messages: Compiler errors, exceptions, validation failures
  4. Add context: Inputs, intermediate steps, expected outputs
  5. Keep it concise: 3-5 examples usually suffice; quality over quantity

Phase 4: LLM-Based Proposal

The reflection LM generates an improved candidate:
# From instruction_proposal.py
def propose_new_texts(
    candidate: dict[str, str],
    reflective_dataset: dict[str, list[dict]],
    components_to_update: list[str]
) -> dict[str, str]:
    new_texts = {}
    
    for component_name in components_to_update:
        # Build prompt with current text + reflective dataset
        prompt = build_reflection_prompt(
            current_text=candidate[component_name],
            examples=reflective_dataset[component_name]
        )
        
        # LLM proposes improvement
        new_text = reflection_lm(prompt)
        new_texts[component_name] = new_text
    
    return new_texts
The default reflection prompt template (from optimize_anything.py):
I am optimizing a parameter in my system. The current parameter value is:
curr_param

Below is evaluation data showing how this parameter performed:
side_info

Your task is to propose a new, improved parameter value.

Carefully analyze all evaluation data. Look for patterns:
- Performance metrics and correlations
- Recurring issues or failure patterns
- Successful behaviors to preserve
- Domain-specific constraints

Based on your analysis, propose a new parameter value that addresses
the identified issues while maintaining what works well.

Provide the new parameter value within ``` blocks.
You can customize the reflection prompt template via ReflectionConfig.reflection_prompt_template. Use <curr_param> and <side_info> as placeholders.

Component Selection Strategies

When optimizing multiple parameters simultaneously, GEPA selects which components to update each iteration:

Round Robin (Default)

# From component_selector.py
class RoundRobinReflectionComponentSelector:
    def select_components(
        self, candidate: dict[str, str], state: GEPAState
    ) -> list[str]:
        # Cycle through components in order
        predictor_names = list(candidate.keys())
        next_id = state.named_predictor_id_to_update_next_for_program_candidate[
            candidate_idx
        ]
        selected = [predictor_names[next_id]]
        # Update for next iteration
        state.named_predictor_id_to_update_next_for_program_candidate[
            candidate_idx
        ] = (next_id + 1) % len(predictor_names)
        return selected
When to use: Multi-component systems where each parameter has independent impact.

All Components

class AllReflectionComponentSelector:
    def select_components(
        self, candidate: dict[str, str], state: GEPAState
    ) -> list[str]:
        # Update all components together
        return list(candidate.keys())
When to use: Tightly coupled parameters that should co-evolve (e.g., prompt + few-shot examples).

Custom Selector

Implement your own selection logic:
from gepa.proposer.reflective_mutation.base import ReflectionComponentSelector

class CustomSelector(ReflectionComponentSelector):
    def select_components(
        self, candidate: dict[str, str], state: GEPAState
    ) -> list[str]:
        # Your custom logic
        # Example: prioritize components with lower scores
        worst_performing = find_worst_component(state)
        return [worst_performing]

config = GEPAConfig(
    reflection=ReflectionConfig(
        module_selector=CustomSelector()
    )
)

Advanced Reflection Features

Skip Perfect Scores

Avoid wasting budget on candidates that already achieve perfect scores:
config = GEPAConfig(
    reflection=ReflectionConfig(
        skip_perfect_score=True,
        perfect_score=1.0  # What counts as "perfect"
    )
)
When enabled, if all minibatch examples score perfectly, GEPA skips reflection and samples a new minibatch.

Custom Reflection Prompts

Use different prompts for different parameters:
config = GEPAConfig(
    reflection=ReflectionConfig(
        reflection_prompt_template={
            "system_prompt": """
                You are optimizing a system prompt for math reasoning.
                Current prompt:
                <curr_param>
                
                Performance data:
                <side_info>
                
                Propose an improved prompt that fixes arithmetic errors
                while preserving chain-of-thought reasoning.
            """,
            "few_shot_examples": """
                You are curating few-shot examples.
                Current examples:
                <curr_param>
                
                Performance data:
                <side_info>
                
                Propose better examples that cover edge cases.
            """
        }
    )
)

Custom Candidate Proposer

Replace the default LLM-based proposer entirely:
from collections.abc import Mapping, Sequence

def my_proposer(
    candidate: dict[str, str],
    reflective_dataset: Mapping[str, Sequence[dict]],
    components_to_update: list[str]
) -> dict[str, str]:
    """
    Custom proposal logic.
    Could use: DSPy signatures, fine-tuned models,
    rule-based systems, genetic operators, etc.
    """
    new_texts = {}
    for component in components_to_update:
        # Your custom logic
        new_texts[component] = generate_improvement(
            candidate[component],
            reflective_dataset[component]
        )
    return new_texts

config = GEPAConfig(
    reflection=ReflectionConfig(
        custom_candidate_proposer=my_proposer
    )
)

Reflection vs Traditional Mutation

Traditional evolutionary algorithms use blind mutation:
Traditional EAGEPA Reflective Mutation
Random character flipsTargeted fixes based on error analysis
”the cat sat” → “tze cat sat""Fix: ‘sat’ should be ‘sits’ for subject-verb agreement”
No failure diagnosisFull trace analysis
Requires 1000s of evaluationsTypically 100-500 evaluations
Cannot explain changesHuman-interpretable edit rationale
Example from AIME math benchmark:
Iteration 5: Reflection identifies:
- Failing on base conversion problems (3/10 correct)
- Success on polynomial problems (8/10 correct)
- Common error: not enforcing digit constraints in base-b

Proposed improvement:
"For base conversion: enforce digit ranges strictly.
In base b, digits ∈ {0,...,b-1}. Translate positional
notation correctly: (abc)_b = a·b² + b·b + c..."

Result: Base conversion accuracy improves from 30% → 80%

Minibatch Size Impact

The reflection minibatch size controls the focus-vs-breadth tradeoff: Small batches (1-3 examples):
  • ✅ Focused, targeted improvements
  • ✅ LLM can deeply analyze each failure
  • ✅ Faster iterations (less evaluation time)
  • ❌ May overfit to shown examples
Large batches (10+ examples):
  • ✅ Broader pattern recognition
  • ✅ Less likely to overfit
  • ❌ LLM prompt gets very long
  • ❌ Harder to identify specific root causes
  • ❌ Slower iterations
Recommended: Start with reflection_minibatch_size=3 (default). Increase for noisy evaluations, decrease for expensive evaluations.
config = GEPAConfig(
    reflection=ReflectionConfig(
        reflection_minibatch_size=3  # 1-3 for most tasks
    )
)

Seedless Mode

GEPA can bootstrap the initial candidate from your objective:
result = optimize_anything(
    seed_candidate=None,  # No starting point
    evaluator=evaluate,
    objective="Generate a Python function that reverses a string",
    background="Use only built-in Python, no imports. Handle Unicode correctly.",
    dataset=test_cases
)
The reflection LM generates the first candidate, then GEPA iterates normally. Useful for:
  • Creative tasks with no obvious starting point
  • Exploratory optimization
  • Bootstrapping from high-level requirements

Next Steps

Pareto Optimization

Learn how GEPA maintains diverse candidates via Pareto frontiers

Actionable Side Information

Deep dive into ASI: the gradient of text optimization

Adapters

Build your own adapter to integrate GEPA with your system

Advanced Configuration

Fine-tune reflection parameters for your use case

Build docs developers (and LLMs) love