Skip to main content

The Problem with Single-Best Selection

Traditional optimizers maintain a single “best” candidate:
# Traditional approach
results = [
    ("Candidate A", [0.9, 0.4, 0.6]),  # Good at task 1
    ("Candidate B", [0.6, 0.9, 0.5]),  # Good at task 2  
    ("Candidate C", [0.7, 0.7, 0.7]),  # Average everywhere
]

# Keep only highest average
best = max(results, key=lambda x: sum(x[1])/len(x[1]))
# Result: Candidate C (avg 0.70)
# Lost: Candidate A (specialist) and B (different specialist)
Problem: By averaging across all tasks, we lose candidates A and B that excel on specific subsets. If we later need to improve task 1 performance, we’ve discarded our best starting point (A).

The Pareto Solution

Pareto optimality: A candidate is Pareto-optimal if no other candidate is strictly better on all objectives. GEPA keeps all Pareto-optimal candidates — the Pareto frontier:
# GEPA approach
pareto_front = [
    ("Candidate A", [0.9, 0.4, 0.6]),  # Best at task 1
    ("Candidate B", [0.6, 0.9, 0.5]),  # Best at task 2
    ("Candidate D", [0.5, 0.6, 0.95]), # Best at task 3
]
# Candidate C not on front: dominated by average of A/B/D on each task
Benefits:
  1. Specialization preservation: Candidates that excel on niche tasks survive
  2. Diverse exploration: Select from multiple strong starting points
  3. Cross-pollination: Merge candidates via system-aware combination
  4. No premature convergence: Can’t get stuck in local optima

Frontier Types in GEPA

GEPA supports four frontier tracking strategies via FrontierType:

1. Instance-Level (frontier_type="instance")

Default mode. Track which candidates are best on each validation example:
# From state.py
pareto_front_valset: dict[DataId, float] = {
    "example_1": 0.95,  # Best score seen for example 1
    "example_2": 0.87,  # Best score seen for example 2
    "example_3": 0.92,  # Best score seen for example 3
}

program_at_pareto_front_valset: dict[DataId, set[int]] = {
    "example_1": {3, 7},    # Candidates 3 and 7 both scored 0.95
    "example_2": {5},       # Only candidate 5 achieved 0.87
    "example_3": {3, 8},    # Candidates 3 and 8 tied at 0.92
}
Use when:
  • Each validation example represents a distinct task/scenario
  • You want candidates specialized to different input types
  • Example: Multi-hop QA with different reasoning patterns
Diagram:
Validation Set:
┌─────────────┬─────────────┬─────────────┐
│  Example 1  │  Example 2  │  Example 3  │
├─────────────┼─────────────┼─────────────┤
│ Cand 3: 0.9 │ Cand 5: 0.8 │ Cand 3: 0.7 │  ← Candidate 3
│ Cand 5: 0.6 │ Cand 7: 0.5 │ Cand 5: 0.6 │  ← Candidate 5
│ Cand 7: 0.7 │ Cand 9: 0.7 │ Cand 7: 0.9 │  ← Candidate 7
└─────────────┴─────────────┴─────────────┘
     ↓               ↓               ↓
  Cand 3 wins    Cand 5 wins    Cand 7 wins

Pareto Front: {3, 5, 7}  ← All survive

2. Objective-Level (frontier_type="objective")

Track which candidates are best on each objective metric:
# From state.py  
objective_pareto_front: dict[str, float] = {
    "accuracy": 0.89,
    "latency_inv": 15.2,  # Higher is better (1/latency)
    "cost_inv": 8.3,
}

program_at_pareto_front_objectives: dict[str, set[int]] = {
    "accuracy": {4, 6},      # Candidates 4 and 6 tied at 0.89
    "latency_inv": {6, 8},   # Candidates 6 and 8 tied at 15.2
    "cost_inv": {4},         # Candidate 4 alone at 8.3
}
Use when:
  • You have explicit multi-objective metrics (accuracy, speed, cost, etc.)
  • Objectives trade off against each other (faster but less accurate)
  • Example: Compiler optimization (speed vs code size vs energy)
How to provide objective scores: Your evaluator must return objective_scores in side_info:
def evaluate(candidate, example):
    result = run_system(candidate, example)
    score = result.accuracy  # Primary score for acceptance
    
    return score, {
        "scores": {  # Multi-objective tracking
            "accuracy": result.accuracy,
            "latency_inv": 1.0 / result.latency_ms,  # Invert: higher = better
            "throughput": result.qps,
        },
        # ... other feedback
    }
All objective scores must follow “higher is better” convention. Invert metrics like latency, error rate, or cost.

3. Hybrid (frontier_type="hybrid")

Combines instance-level AND objective-level frontiers. A candidate joins the Pareto front if it’s best on any example or any objective:
# Candidate survives if:
# 1. Best on at least one validation example, OR
# 2. Best on at least one objective metric

pareto_programs = (
    state.get_instance_pareto_programs() |
    state.get_objective_pareto_programs()
)
Use when:
  • You care about both per-example performance AND aggregate metrics
  • Want maximum diversity in the candidate pool
  • Example: Multi-task agent (some tasks need specialized handling, overall metrics matter too)

4. Cartesian (frontier_type="cartesian")

Track best candidates on every (example, objective) pair:
# From state.py
pareto_front_cartesian: dict[tuple[DataId, str], float] = {
    ("example_1", "accuracy"): 0.95,
    ("example_1", "latency_inv"): 12.3,
    ("example_2", "accuracy"): 0.88,
    ("example_2", "latency_inv"): 18.7,
    # ...
}

program_at_pareto_front_cartesian: dict[tuple[DataId, str], set[int]] = {
    ("example_1", "accuracy"): {5},
    ("example_1", "latency_inv"): {3, 5},
    # ...
}
Use when:
  • Extreme diversity is critical
  • Different examples have different objective priorities
  • Very large validation sets (100s-1000s of examples)
  • Example: Heterogeneous workload optimization (each query has different perf characteristics)
Cartesian frontiers can grow very large (|examples| × |objectives| keys). Use with validation set subsampling policies.

Pareto Front Updates

When a new candidate is evaluated on the validation set, GEPA updates the frontier:
# From state.py
def update_state_with_new_program(
    self,
    new_program: dict[str, str],
    valset_evaluation: ValsetEvaluation,
    parent_program_idx: list[int],
) -> int:
    new_idx = len(self.program_candidates)
    self.program_candidates.append(dict(new_program))
    
    # Update per-example scores
    new_subscores = dict(valset_evaluation.scores_by_val_id)
    self.prog_candidate_val_subscores.append(new_subscores)
    
    # Update Pareto front for each example
    for val_id, new_score in new_subscores.items():
        current_best = self.pareto_front_valset[val_id]
        
        if new_score > current_best:
            # New champion for this example
            self.pareto_front_valset[val_id] = new_score
            self.program_at_pareto_front_valset[val_id] = {new_idx}
        
        elif new_score == current_best:
            # Tie: add to front
            self.program_at_pareto_front_valset[val_id].add(new_idx)
        
        # else: new_score < current_best, not on front for this example
    
    # Similarly for objective-level, cartesian, etc.
    # ...
    
    return new_idx
Key behavior:
  • Candidates that improve any example join the front
  • Ties are preserved (multiple candidates at the same score)
  • Dominated candidates are removed from the front but not deleted from program_candidates
  • History is preserved for analysis and potential restoration

Candidate Selection from the Front

During each iteration, GEPA selects a candidate from the Pareto front to evolve:

Pareto Selection Strategy (Default)

# From candidate_selector.py
class ParetoCandidateSelector:
    def select_candidate_idx(self, state: GEPAState) -> int:
        # Get all programs on the front
        pareto_programs = state.get_pareto_front_programs()
        
        # Uniform random selection
        return self.rng.choice(list(pareto_programs))
Benefits:
  • Every specialized candidate gets a chance to evolve
  • Exploration across different task subsets
  • Prevents premature convergence

Alternative Strategies

Current Best (Greedy):
config = GEPAConfig(
    engine=EngineConfig(
        candidate_selection_strategy="current_best"
    )
)
# Always select the single highest-scoring candidate
# Faster convergence, but may miss specialized improvements
Epsilon-Greedy:
config = GEPAConfig(
    engine=EngineConfig(
        candidate_selection_strategy="epsilon_greedy"
    )
)
# Select best with probability 0.9, random from front with 0.1
# Balances exploitation and exploration

Validation Evaluation Policies

Control how many validation examples are scored each iteration:

Full Evaluation (Default)

config = GEPAConfig(
    engine=EngineConfig(
        val_evaluation_policy="full_eval"
    )
)
# Evaluate every validation example every time
# Most accurate, but expensive for large validation sets

Incremental Evaluation (Future)

GEPA will support subsampling policies:
# Not yet implemented
from gepa.strategies.eval_policy import IncrementalEvaluationPolicy

policy = IncrementalEvaluationPolicy(
    initial_batch_size=10,
    expansion_rate=1.5,
    convergence_threshold=0.01
)

config = GEPAConfig(
    engine=EngineConfig(
        val_evaluation_policy=policy
    )
)
# Evaluate 10 examples initially, expand if promising
# Reduces evaluations by 3-5x with minimal accuracy loss

Merge: Cross-Pollinating the Frontier

GEPA can merge two Pareto-optimal candidates that excel on different validation subsets:
# From merge.py
class MergeProposer:
    def propose(self, state: GEPAState) -> CandidateProposal:
        # Find two candidates on the front with complementary strengths
        parent1, parent2 = select_merge_parents(state)
        
        # LLM-based merge prompt:
        # "Combine the strengths of these two candidates..."
        merged = merge_lm(parent1, parent2, reflective_context)
        
        return CandidateProposal(
            candidate=merged,
            parent_program_ids=[parent1_idx, parent2_idx]
        )
Enable merge:
config = GEPAConfig(
    merge=MergeConfig(
        max_merge_invocations=5,  # Attempt up to 5 merges
        merge_val_overlap_floor=5  # Require 5+ shared validation examples
    )
)
How it works:
  1. After a successful reflective mutation, GEPA schedules a merge attempt
  2. Select two parents that:
    • Both on Pareto front
    • Excel on different validation subsets
    • Share at least merge_val_overlap_floor examples for comparison
  3. LLM merges them by combining successful strategies
  4. Evaluate on a subsample overlapping both parents’ strengths
  5. Accept if merged candidate beats both parents on the subsample
Example: Candidate A solves algebraic problems well, B solves geometric problems well. Merge produces C that handles both.

Pareto Front Analysis

Inspect the Pareto front in your optimization results:
result = optimize_anything(...)

# Get all candidates on the front
front = result.state.get_pareto_front_programs()
print(f"Pareto front size: {len(front)}")

# Analyze each frontier candidate
for prog_idx in front:
    candidate = result.state.program_candidates[prog_idx]
    scores = result.state.prog_candidate_val_subscores[prog_idx]
    
    print(f"\nCandidate {prog_idx}:")
    print(f"  Average score: {sum(scores.values()) / len(scores):.3f}")
    print(f"  Best examples: {[k for k,v in scores.items() if v > 0.9]}")
    print(f"  Worst examples: {[k for k,v in scores.items() if v < 0.5]}")
Visualize the tradeoff space:
import matplotlib.pyplot as plt

# For objective-level frontiers
objective_scores = result.state.prog_candidate_objective_scores
front = result.state.get_pareto_front_programs()

plt.figure(figsize=(10, 6))
for prog_idx in front:
    scores = objective_scores[prog_idx]
    plt.scatter(
        scores["accuracy"],
        scores["latency_inv"],
        s=100,
        label=f"Candidate {prog_idx}"
    )

plt.xlabel("Accuracy")
plt.ylabel("Speed (1/latency)")
plt.title("Pareto Front: Accuracy vs Speed Tradeoff")
plt.legend()
plt.show()

Choosing the Right Frontier Type

Use CaseRecommended TypeWhy
Single metric, diverse tasksinstancePer-example specialization
Multi-objective optimizationobjectiveExplicit metric tradeoffs
Agent with tool specializationinstanceDifferent tools excel on different queries
Compiler optimizationobjectiveSpeed/size/energy tradeoffs
Maximum diversityhybrid or cartesianPreserve all forms of specialization
Small validation set (<20)instance or objectiveSimple and effective
Large validation set (100s+)objective + subsamplingAvoid front explosion

Performance Implications

Frontier size impact:
  • Instance-level: ~1-10 candidates for small valsets, 5-20 for large
  • Objective-level: ~2-5 candidates per objective
  • Hybrid: Sum of above
  • Cartesian: Can grow to 50+ candidates
Larger frontiers:
  • ✅ More diversity, better exploration
  • ✅ More opportunities for merge
  • ❌ More candidates to select from (negligible cost)
  • ❌ More state to serialize (marginal)
Recommendation: Start with instance (default). Switch to objective if you have explicit multi-objective metrics. Use hybrid for maximum diversity at minimal cost.

Next Steps

Actionable Side Information

Learn how rich feedback drives Pareto-efficient search

How GEPA Works

Understand the full optimization loop

Advanced Configuration

Configure frontier types and merge strategies

Evaluation Policies

Optimize validation evaluation for large datasets

Build docs developers (and LLMs) love