Skip to main content

Overview

GEPA (Genetic-Pareto) is an evolutionary optimization framework that evolves text parameters using LLM-based reflection and Pareto-efficient search. Unlike traditional optimizers that only know that a candidate failed, GEPA uses LLMs to understand why it failed and proposes targeted improvements.

Core Architecture

GEPA’s architecture consists of several key components working together:

The Optimization Loop

GEPA’s optimization follows a five-step iterative process:

1. Select

Select a candidate from the Pareto frontier — the set of candidates where each excels on different subsets of the validation set.
# From state.py and candidate_selector.py
curr_prog_id = candidate_selector.select_candidate_idx(state)
curr_prog = state.program_candidates[curr_prog_id]
Why Pareto? Instead of maintaining only the single best candidate (which might average across different task types), GEPA keeps all candidates that are best at something. This enables:
  • Preservation of specialized improvements
  • Cross-pollination between candidates via merge
  • Diverse exploration without premature convergence

2. Execute

Evaluate the selected candidate on a small minibatch of training examples (typically 1-3), capturing full execution traces:
# From reflective_mutation.py
subsample_ids = batch_sampler.next_minibatch_ids(trainset, state)
minibatch = trainset.fetch(subsample_ids)

# Evaluate with trace capture
eval_result = adapter.evaluate(minibatch, curr_prog, capture_traces=True)
The adapter captures:
  • Outputs and scores for each example
  • Trajectories: Execution traces showing intermediate steps
  • Error messages, reasoning logs, profiling data
The minibatch approach is key to efficiency: instead of showing hundreds of examples at once, GEPA focuses the LLM on 1-3 examples for targeted improvements. Over iterations, all examples get attention.

3. Reflect

The adapter builds a reflective dataset from the captured traces, then an LLM analyzes failures and identifies root causes:
# From adapter.py
reflective_dataset = adapter.make_reflective_dataset(
    candidate=curr_prog,
    eval_batch=eval_result,
    components_to_update=selected_components
)

# Typical reflective dataset structure:
{
    "system_prompt": [
        {
            "Inputs": {"question": "What is 2+2?"},
            "Generated Outputs": "The answer is 5",
            "Feedback": "Incorrect arithmetic. Expected: 4"
        },
        # ... more examples
    ]
}
This dataset is fed to a reflection LM (typically GPT-5 or Claude Opus) that:
  • Diagnoses why failures occurred
  • Identifies patterns across multiple examples
  • Understands domain constraints and requirements

4. Mutate

The reflection LM proposes an improved candidate, informed by:
  • Current component text being optimized
  • Reflective dataset with failure analysis
  • Accumulated lessons from ancestor candidates in the lineage
  • Objective and background (optional domain guidance)
# From instruction_proposal.py and reflective_mutation.py
new_texts = propose_new_texts(
    candidate=curr_prog,
    reflective_dataset=reflective_dataset,
    components_to_update=components_to_update
)

new_candidate = {**curr_prog, **new_texts}
The proposer uses a carefully designed prompt template that includes:
  • The current parameter value
  • Evaluation feedback structured as side information
  • Instructions to analyze failures and propose improvements

5. Accept

Test the new candidate on the same minibatch:
# From engine.py
eval_new = adapter.evaluate(minibatch, new_candidate, capture_traces=False)

old_sum = sum(eval_curr.scores)
new_sum = sum(eval_new.scores)

if new_sum > old_sum:
    # Accept: evaluate on full validation set
    valset_eval = evaluate_on_valset(new_candidate, state)
    state.update_state_with_new_program(
        new_program=new_candidate,
        valset_evaluation=valset_eval,
        parent_program_idx=[curr_prog_id]
    )
Acceptance criteria:
  • New candidate must strictly improve on the minibatch (sum of scores)
  • If accepted, full validation evaluation determines Pareto front membership
  • Candidates that improve on any validation example join the frontier

Key Components

GEPAAdapter

The adapter is your integration point with GEPA. It implements three responsibilities:
  1. evaluate(): Execute your system with a candidate and return scores/traces
  2. make_reflective_dataset(): Extract meaningful feedback from execution traces
  3. propose_new_texts() (optional): Custom proposal logic (defaults to LLM-based)
See the Adapters guide for implementation details.

GEPAEngine

The engine orchestrates the optimization loop:
# From engine.py
class GEPAEngine:
    def run(self) -> GEPAState:
        # Initialize state with seed candidate
        state = initialize_gepa_state(...)
        
        # Main optimization loop
        while not self._should_stop(state):
            # 1. Attempt merge (if scheduled)
            if merge_proposer and conditions_met:
                proposal = merge_proposer.propose(state)
                # ... evaluate and possibly accept
            
            # 2. Reflective mutation
            proposal = reflective_proposer.propose(state)
            if proposal and improved:
                # Accept and update Pareto front
                self._run_full_eval_and_add(proposal, state)
        
        return state

GEPAState

Maintains all optimization state:
  • program_candidates: All explored candidates
  • prog_candidate_val_subscores: Per-example validation scores
  • prog_candidate_objective_scores: Per-objective aggregate scores
  • Pareto frontiers: Instance-level, objective-level, hybrid, or cartesian
  • evaluation_cache: Optional cache for (candidate, example) pairs
  • Budget tracking: Total evaluations consumed

Three Optimization Modes

GEPA unifies three paradigms under one API: When: dataset=None, valset=None Use case: Solve one hard problem. The candidate is the solution. Example: Circle packing, SVG optimization, mathematical puzzle solving
def evaluate(candidate: str) -> float:
    result = run_code(candidate)
    oa.log(f"Score: {result.score}")
    return result.score

result = optimize_anything(
    seed_candidate="def pack_circles(): ...",
    evaluator=evaluate,
    objective="Maximize sum of radii for n circles in unit square"
)
When: dataset=<list>, valset=None Use case: Solve a batch of related problems with cross-task transfer. Example: CUDA kernel generation for multiple operations
def evaluate(candidate, example):
    kernel = compile_cuda(candidate["code"], example["operation"])
    return measure_throughput(kernel)

result = optimize_anything(
    seed_candidate={"code": "__global__ void kernel() { ... }"},
    evaluator=evaluate,
    dataset=kernel_problems,
    objective="Generate fast CUDA kernels"
)

3. Generalization

When: dataset=<list>, valset=<list> Use case: Build a skill that transfers to unseen problems. Example: Prompt optimization for competition math
def evaluate(candidate, example):
    pred = llm(candidate["system_prompt"], example["question"])
    return 1.0 if pred == example["answer"] else 0.0

result = optimize_anything(
    seed_candidate={"system_prompt": "You are a math tutor..."},
    evaluator=evaluate,
    dataset=train_problems,
    valset=val_problems,
    objective="Improve math reasoning accuracy"
)

Efficiency Features

Evaluation Caching

GEPA can cache (candidate, example) evaluation results:
config = GEPAConfig(
    engine=EngineConfig(
        cache_evaluation=True,
        cache_evaluation_storage="disk",  # or "memory"
        run_dir="./gepa_run"
    )
)
When useful:
  • Expensive evaluations (compilation, simulation, API calls)
  • Candidates may be re-evaluated across iterations
  • Deterministic evaluation functions

Parallel Evaluation

Parallelize validation set evaluation:
config = GEPAConfig(
    engine=EngineConfig(
        parallel=True,
        max_workers=16
    )
)

State Persistence

GEPA automatically saves/resumes state when run_dir is set:
config = GEPAConfig(
    engine=EngineConfig(
        run_dir="./gepa_run"
    )
)

# First run
result = optimize_anything(..., config=config)

# Resume from checkpoint (if interrupted)
result = optimize_anything(..., config=config)  # Continues from last state

Next Steps

Reflective Evolution

Learn how LLM-based reflection drives candidate improvement

Pareto Optimization

Understand multi-objective Pareto-efficient search

Actionable Side Information

Master the key concept that makes text optimization work

Quickstart

Try GEPA on a real example

Build docs developers (and LLMs) love