Traditional optimizers maintain a single “best” candidate:
# Traditional approachresults = [ ("Candidate A", [0.9, 0.4, 0.6]), # Good at task 1 ("Candidate B", [0.6, 0.9, 0.5]), # Good at task 2 ("Candidate C", [0.7, 0.7, 0.7]), # Average everywhere]# Keep only highest averagebest = max(results, key=lambda x: sum(x[1])/len(x[1]))# Result: Candidate C (avg 0.70)# Lost: Candidate A (specialist) and B (different specialist)
Problem: By averaging across all tasks, we lose candidates A and B that excel on specific subsets. If we later need to improve task 1 performance, we’ve discarded our best starting point (A).
Pareto optimality: A candidate is Pareto-optimal if no other candidate is strictly better on all objectives.GEPA keeps all Pareto-optimal candidates — the Pareto frontier:
# GEPA approachpareto_front = [ ("Candidate A", [0.9, 0.4, 0.6]), # Best at task 1 ("Candidate B", [0.6, 0.9, 0.5]), # Best at task 2 ("Candidate D", [0.5, 0.6, 0.95]), # Best at task 3]# Candidate C not on front: dominated by average of A/B/D on each task
Benefits:
Specialization preservation: Candidates that excel on niche tasks survive
Diverse exploration: Select from multiple strong starting points
Cross-pollination: Merge candidates via system-aware combination
No premature convergence: Can’t get stuck in local optima
Default mode. Track which candidates are best on each validation example:
# From state.pypareto_front_valset: dict[DataId, float] = { "example_1": 0.95, # Best score seen for example 1 "example_2": 0.87, # Best score seen for example 2 "example_3": 0.92, # Best score seen for example 3}program_at_pareto_front_valset: dict[DataId, set[int]] = { "example_1": {3, 7}, # Candidates 3 and 7 both scored 0.95 "example_2": {5}, # Only candidate 5 achieved 0.87 "example_3": {3, 8}, # Candidates 3 and 8 tied at 0.92}
Use when:
Each validation example represents a distinct task/scenario
You want candidates specialized to different input types
Example: Multi-hop QA with different reasoning patterns
Combines instance-level AND objective-level frontiers. A candidate joins the Pareto front if it’s best on any example or any objective:
# Candidate survives if:# 1. Best on at least one validation example, OR# 2. Best on at least one objective metricpareto_programs = ( state.get_instance_pareto_programs() | state.get_objective_pareto_programs())
Use when:
You care about both per-example performance AND aggregate metrics
When a new candidate is evaluated on the validation set, GEPA updates the frontier:
# From state.pydef update_state_with_new_program( self, new_program: dict[str, str], valset_evaluation: ValsetEvaluation, parent_program_idx: list[int],) -> int: new_idx = len(self.program_candidates) self.program_candidates.append(dict(new_program)) # Update per-example scores new_subscores = dict(valset_evaluation.scores_by_val_id) self.prog_candidate_val_subscores.append(new_subscores) # Update Pareto front for each example for val_id, new_score in new_subscores.items(): current_best = self.pareto_front_valset[val_id] if new_score > current_best: # New champion for this example self.pareto_front_valset[val_id] = new_score self.program_at_pareto_front_valset[val_id] = {new_idx} elif new_score == current_best: # Tie: add to front self.program_at_pareto_front_valset[val_id].add(new_idx) # else: new_score < current_best, not on front for this example # Similarly for objective-level, cartesian, etc. # ... return new_idx
Key behavior:
Candidates that improve any example join the front
Ties are preserved (multiple candidates at the same score)
Dominated candidates are removed from the front but not deleted from program_candidates
History is preserved for analysis and potential restoration
# From candidate_selector.pyclass ParetoCandidateSelector: def select_candidate_idx(self, state: GEPAState) -> int: # Get all programs on the front pareto_programs = state.get_pareto_front_programs() # Uniform random selection return self.rng.choice(list(pareto_programs))
Benefits:
Every specialized candidate gets a chance to evolve
config = GEPAConfig( engine=EngineConfig( candidate_selection_strategy="current_best" ))# Always select the single highest-scoring candidate# Faster convergence, but may miss specialized improvements
Epsilon-Greedy:
config = GEPAConfig( engine=EngineConfig( candidate_selection_strategy="epsilon_greedy" ))# Select best with probability 0.9, random from front with 0.1# Balances exploitation and exploration
config = GEPAConfig( engine=EngineConfig( val_evaluation_policy="full_eval" ))# Evaluate every validation example every time# Most accurate, but expensive for large validation sets
GEPA can merge two Pareto-optimal candidates that excel on different validation subsets:
# From merge.pyclass MergeProposer: def propose(self, state: GEPAState) -> CandidateProposal: # Find two candidates on the front with complementary strengths parent1, parent2 = select_merge_parents(state) # LLM-based merge prompt: # "Combine the strengths of these two candidates..." merged = merge_lm(parent1, parent2, reflective_context) return CandidateProposal( candidate=merged, parent_program_ids=[parent1_idx, parent2_idx] )
Enable merge:
config = GEPAConfig( merge=MergeConfig( max_merge_invocations=5, # Attempt up to 5 merges merge_val_overlap_floor=5 # Require 5+ shared validation examples ))
How it works:
After a successful reflective mutation, GEPA schedules a merge attempt
Select two parents that:
Both on Pareto front
Excel on different validation subsets
Share at least merge_val_overlap_floor examples for comparison
LLM merges them by combining successful strategies
Evaluate on a subsample overlapping both parents’ strengths
Accept if merged candidate beats both parents on the subsample
Example: Candidate A solves algebraic problems well, B solves geometric problems well. Merge produces C that handles both.
Inspect the Pareto front in your optimization results:
result = optimize_anything(...)# Get all candidates on the frontfront = result.state.get_pareto_front_programs()print(f"Pareto front size: {len(front)}")# Analyze each frontier candidatefor prog_idx in front: candidate = result.state.program_candidates[prog_idx] scores = result.state.prog_candidate_val_subscores[prog_idx] print(f"\nCandidate {prog_idx}:") print(f" Average score: {sum(scores.values()) / len(scores):.3f}") print(f" Best examples: {[k for k,v in scores.items() if v > 0.9]}") print(f" Worst examples: {[k for k,v in scores.items() if v < 0.5]}")
Visualize the tradeoff space:
import matplotlib.pyplot as plt# For objective-level frontiersobjective_scores = result.state.prog_candidate_objective_scoresfront = result.state.get_pareto_front_programs()plt.figure(figsize=(10, 6))for prog_idx in front: scores = objective_scores[prog_idx] plt.scatter( scores["accuracy"], scores["latency_inv"], s=100, label=f"Candidate {prog_idx}" )plt.xlabel("Accuracy")plt.ylabel("Speed (1/latency)")plt.title("Pareto Front: Accuracy vs Speed Tradeoff")plt.legend()plt.show()
Instance-level: ~1-10 candidates for small valsets, 5-20 for large
Objective-level: ~2-5 candidates per objective
Hybrid: Sum of above
Cartesian: Can grow to 50+ candidates
Larger frontiers:
✅ More diversity, better exploration
✅ More opportunities for merge
❌ More candidates to select from (negligible cost)
❌ More state to serialize (marginal)
Recommendation: Start with instance (default). Switch to objective if you have explicit multi-objective metrics. Use hybrid for maximum diversity at minimal cost.