How GEPA Works

Overview

GEPA (Genetic-Pareto) is an evolutionary optimization framework that evolves text parameters using LLM-based reflection and Pareto-efficient search. Unlike traditional optimizers that only know that a candidate failed, GEPA uses LLMs to understand why it failed and proposes targeted improvements.

Core Architecture

GEPA’s architecture consists of several key components working together:

The Optimization Loop

GEPA’s optimization follows a five-step iterative process:

1. Select

Select a candidate from the Pareto frontier — the set of candidates where each excels on different subsets of the validation set.

# From state.py and candidate_selector.py
curr_prog_id = candidate_selector.select_candidate_idx(state)
curr_prog = state.program_candidates[curr_prog_id]

Why Pareto? Instead of maintaining only the single best candidate (which might average across different task types), GEPA keeps all candidates that are best at something. This enables:

Preservation of specialized improvements
Cross-pollination between candidates via merge
Diverse exploration without premature convergence

2. Execute

Evaluate the selected candidate on a small minibatch of training examples (typically 1-3), capturing full execution traces:

# From reflective_mutation.py
subsample_ids = batch_sampler.next_minibatch_ids(trainset, state)
minibatch = trainset.fetch(subsample_ids)

# Evaluate with trace capture
eval_result = adapter.evaluate(minibatch, curr_prog, capture_traces=True)

The adapter captures:

Outputs and scores for each example
Trajectories: Execution traces showing intermediate steps
Error messages, reasoning logs, profiling data

The minibatch approach is key to efficiency: instead of showing hundreds of examples at once, GEPA focuses the LLM on 1-3 examples for targeted improvements. Over iterations, all examples get attention.

3. Reflect

The adapter builds a reflective dataset from the captured traces, then an LLM analyzes failures and identifies root causes:

# From adapter.py
reflective_dataset = adapter.make_reflective_dataset(
    candidate=curr_prog,
    eval_batch=eval_result,
    components_to_update=selected_components
)

# Typical reflective dataset structure:
{
    "system_prompt": [
        {
            "Inputs": {"question": "What is 2+2?"},
            "Generated Outputs": "The answer is 5",
            "Feedback": "Incorrect arithmetic. Expected: 4"
        },
        # ... more examples
    ]
}

This dataset is fed to a reflection LM (typically GPT-5 or Claude Opus) that:

Diagnoses why failures occurred
Identifies patterns across multiple examples
Understands domain constraints and requirements

4. Mutate

The reflection LM proposes an improved candidate, informed by:

Current component text being optimized
Reflective dataset with failure analysis
Accumulated lessons from ancestor candidates in the lineage
Objective and background (optional domain guidance)

# From instruction_proposal.py and reflective_mutation.py
new_texts = propose_new_texts(
    candidate=curr_prog,
    reflective_dataset=reflective_dataset,
    components_to_update=components_to_update
)

new_candidate = {**curr_prog, **new_texts}

The proposer uses a carefully designed prompt template that includes:

The current parameter value
Evaluation feedback structured as side information
Instructions to analyze failures and propose improvements

5. Accept

Test the new candidate on the same minibatch:

# From engine.py
eval_new = adapter.evaluate(minibatch, new_candidate, capture_traces=False)

old_sum = sum(eval_curr.scores)
new_sum = sum(eval_new.scores)

if new_sum > old_sum:
    # Accept: evaluate on full validation set
    valset_eval = evaluate_on_valset(new_candidate, state)
    state.update_state_with_new_program(
        new_program=new_candidate,
        valset_evaluation=valset_eval,
        parent_program_idx=[curr_prog_id]
    )

Acceptance criteria:

New candidate must strictly improve on the minibatch (sum of scores)
If accepted, full validation evaluation determines Pareto front membership
Candidates that improve on any validation example join the frontier

Key Components

GEPAAdapter

The adapter is your integration point with GEPA. It implements three responsibilities:

evaluate(): Execute your system with a candidate and return scores/traces
make_reflective_dataset(): Extract meaningful feedback from execution traces
propose_new_texts() (optional): Custom proposal logic (defaults to LLM-based)

See the Adapters guide for implementation details.

GEPAEngine

The engine orchestrates the optimization loop:

# From engine.py
class GEPAEngine:
    def run(self) -> GEPAState:
        # Initialize state with seed candidate
        state = initialize_gepa_state(...)
        
        # Main optimization loop
        while not self._should_stop(state):
            # 1. Attempt merge (if scheduled)
            if merge_proposer and conditions_met:
                proposal = merge_proposer.propose(state)
                # ... evaluate and possibly accept
            
            # 2. Reflective mutation
            proposal = reflective_proposer.propose(state)
            if proposal and improved:
                # Accept and update Pareto front
                self._run_full_eval_and_add(proposal, state)
        
        return state

GEPAState

Maintains all optimization state:

program_candidates: All explored candidates
prog_candidate_val_subscores: Per-example validation scores
prog_candidate_objective_scores: Per-objective aggregate scores
Pareto frontiers: Instance-level, objective-level, hybrid, or cartesian
evaluation_cache: Optional cache for (candidate, example) pairs
Budget tracking: Total evaluations consumed

Three Optimization Modes

GEPA unifies three paradigms under one API:

1. Single-Task Search

When: dataset=None, valset=None Use case: Solve one hard problem. The candidate is the solution. Example: Circle packing, SVG optimization, mathematical puzzle solving

def evaluate(candidate: str) -> float:
    result = run_code(candidate)
    oa.log(f"Score: {result.score}")
    return result.score

result = optimize_anything(
    seed_candidate="def pack_circles(): ...",
    evaluator=evaluate,
    objective="Maximize sum of radii for n circles in unit square"
)

2. Multi-Task Search

When: dataset=<list>, valset=None Use case: Solve a batch of related problems with cross-task transfer. Example: CUDA kernel generation for multiple operations

def evaluate(candidate, example):
    kernel = compile_cuda(candidate["code"], example["operation"])
    return measure_throughput(kernel)

result = optimize_anything(
    seed_candidate={"code": "__global__ void kernel() { ... }"},
    evaluator=evaluate,
    dataset=kernel_problems,
    objective="Generate fast CUDA kernels"
)

3. Generalization

When: dataset=<list>, valset=<list> Use case: Build a skill that transfers to unseen problems. Example: Prompt optimization for competition math

def evaluate(candidate, example):
    pred = llm(candidate["system_prompt"], example["question"])
    return 1.0 if pred == example["answer"] else 0.0

result = optimize_anything(
    seed_candidate={"system_prompt": "You are a math tutor..."},
    evaluator=evaluate,
    dataset=train_problems,
    valset=val_problems,
    objective="Improve math reasoning accuracy"
)

Efficiency Features

Evaluation Caching

GEPA can cache (candidate, example) evaluation results:

config = GEPAConfig(
    engine=EngineConfig(
        cache_evaluation=True,
        cache_evaluation_storage="disk",  # or "memory"
        run_dir="./gepa_run"
    )
)

When useful:

Expensive evaluations (compilation, simulation, API calls)
Candidates may be re-evaluated across iterations
Deterministic evaluation functions

Parallel Evaluation

Parallelize validation set evaluation:

config = GEPAConfig(
    engine=EngineConfig(
        parallel=True,
        max_workers=16
    )
)

State Persistence

GEPA automatically saves/resumes state when run_dir is set:

config = GEPAConfig(
    engine=EngineConfig(
        run_dir="./gepa_run"
    )
)

# First run
result = optimize_anything(..., config=config)

# Resume from checkpoint (if interrupted)
result = optimize_anything(..., config=config)  # Continues from last state

Next Steps

Reflective Evolution

Learn how LLM-based reflection drives candidate improvement

Pareto Optimization

Understand multi-objective Pareto-efficient search

Actionable Side Information

Master the key concept that makes text optimization work

Quickstart

Try GEPA on a real example

Get Started

Core Concepts

Guides

Use Cases

How GEPA Works

Overview

Core Architecture

The Optimization Loop

1. Select

2. Execute

3. Reflect

4. Mutate

5. Accept

Key Components

GEPAAdapter

GEPAEngine

GEPAState

Three Optimization Modes

1. Single-Task Search

2. Multi-Task Search

3. Generalization

Efficiency Features

Evaluation Caching

Parallel Evaluation

State Persistence

Next Steps

Reflective Evolution

Pareto Optimization

Actionable Side Information

Quickstart

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Use Cases

​Overview

​Core Architecture

​The Optimization Loop

​1. Select

​2. Execute

​3. Reflect

​4. Mutate

​5. Accept

​Key Components

​GEPAAdapter

​GEPAEngine

​GEPAState

​Three Optimization Modes

​1. Single-Task Search

​2. Multi-Task Search

​3. Generalization

​Efficiency Features

​Evaluation Caching

​Parallel Evaluation

​State Persistence

​Next Steps

Reflective Evolution

Pareto Optimization

Actionable Side Information

Quickstart

Build docs developers (and LLMs) love

Overview

Core Architecture

The Optimization Loop

1. Select

2. Execute

3. Reflect

4. Mutate

5. Accept

Key Components

GEPAAdapter

GEPAEngine

GEPAState

Three Optimization Modes

1. Single-Task Search

2. Multi-Task Search

3. Generalization

Efficiency Features

Evaluation Caching

Parallel Evaluation

State Persistence

Next Steps