Evaluation Metrics

Evaluation metrics are the foundation of GEPA optimization. Your metric defines “better” and guides the search. This guide shows you how to design effective evaluators and provide actionable feedback.

Evaluator Basics

An evaluator is a function that scores a candidate. It can return:

Just a score: float
Score with side info: tuple[float, dict]

Higher scores are always better.

# Simple evaluator
def evaluate(candidate: str, example: dict) -> float:
    result = test_code(candidate, example)
    return 1.0 if result.passed else 0.0

# Evaluator with side info
def evaluate(candidate: str, example: dict) -> tuple[float, dict]:
    result = test_code(candidate, example)
    side_info = {
        "Input": example["input"],
        "Output": result.output,
        "Expected": example["expected"],
        "Error": result.error if result.error else None,
    }
    return result.score, side_info

Score Design Principles

1. Higher is Better

GEPA always maximizes scores. Transform metrics accordingly:

# ✓ Correct: Higher accuracy is better
score = accuracy  # 0.0 to 1.0

# ✓ Correct: Transform latency (lower is better → higher is better)
score = 1.0 / (latency + 1e-6)  # Inverse
score = max_latency - latency   # Subtraction from max
score = -latency                # Negation

# ✗ Wrong: Lower error is better (GEPA will maximize it!)
score = error_rate  # Don't do this

2. Meaningful Scale

Use scores in a consistent range:

# Binary: 0.0 or 1.0
score = 1.0 if correct else 0.0

# Continuous: 0.0 to 1.0
score = bleu_score(prediction, reference)

# Unbounded but normalized
score = min(performance_metric / target_performance, 1.0)

3. Granular Feedback

Provide scores that differentiate candidates:

# ✗ Too coarse: Only 0 or 1
def evaluate(candidate, example):
    result = run_test(candidate)
    return 1.0 if result.perfect else 0.0

# ✓ More granular: Partial credit
def evaluate(candidate, example):
    result = run_test(candidate)
    score = 0.0
    
    # Syntax correctness: 0.2
    if result.parses:
        score += 0.2
    
    # Some tests pass: 0.3
    if result.partial_tests > 0:
        score += 0.3 * (result.partial_tests / result.total_tests)
    
    # All tests pass: remaining 0.5
    if result.all_tests_pass:
        score += 0.5
    
    return score

Side Information (ASI)

Actionable Side Information is the text-optimization analogue of the gradient. More informative ASI → better optimization.

Basic Structure

side_info = {
    # What went in
    "Input": example["input"],
    
    # What came out
    "Output": result.output,
    
    # What was expected
    "Expected": example["expected"],
    
    # What went wrong
    "Error": result.error,
    
    # How to fix it
    "Feedback": "The solution should handle edge case X",
}

Multi-Objective Metrics

Track multiple objectives in the "scores" field:

def evaluate(candidate, example) -> tuple[float, dict]:
    result = run_test(candidate, example)
    
    # Primary score
    score = result.correctness
    
    # Side info with multi-objective scores
    side_info = {
        "scores": {
            "correctness": result.correctness,
            "efficiency": 1.0 / (result.time_ms + 1),  # Inverse time
            "code_quality": result.quality_score,
        },
        "Output": result.output,
        "Feedback": generate_feedback(result),
    }
    
    return score, side_info

GEPA tracks each objective separately on the Pareto frontier, preserving specialized improvements.

Parameter-Specific Info

Provide feedback specific to each parameter:

side_info = {
    # Program-level feedback
    "Input": example["question"],
    "Output": result.answer,
    "Expected": example["expected"],
    
    # System prompt specific
    "system_prompt_specific_info": {
        "scores": {"tone_appropriateness": 0.8},
        "Feedback": "System prompt sets good tone but needs more task-specific guidance.",
    },
    
    # User template specific
    "user_template_specific_info": {
        "scores": {"clarity": 0.6},
        "Feedback": "Template should include more context from the input.",
    },
}

During reflection on parameter X, GEPA merges top-level fields with X_specific_info.

Common Evaluation Patterns

Exact Match

def exact_match(candidate, example) -> tuple[float, dict]:
    result = run(candidate, example["input"])
    correct = result == example["expected"]
    
    side_info = {
        "Input": example["input"],
        "Output": result,
        "Expected": example["expected"],
        "Feedback": (
            "Correct!" if correct 
            else f"Expected '{example['expected']}' but got '{result}'"
        ),
    }
    
    return 1.0 if correct else 0.0, side_info

Contains Match

def contains_match(candidate, example) -> tuple[float, dict]:
    result = run(candidate, example["input"])
    answer = example["answer"]
    contains = answer in result
    
    side_info = {
        "Input": example["input"],
        "Output": result,
        "Expected": f"Should contain: {answer}",
        "Feedback": (
            f"Correct! Found answer '{answer}'"
            if contains
            else f"Answer '{answer}' not found in output"
        ),
    }
    
    return 1.0 if contains else 0.0, side_info

Token Overlap (F1)

def token_f1(candidate, example) -> tuple[float, dict]:
    prediction = run(candidate, example["input"])
    reference = example["expected"]
    
    pred_tokens = set(prediction.lower().split())
    ref_tokens = set(reference.lower().split())
    
    if len(pred_tokens) == 0 or len(ref_tokens) == 0:
        return 0.0, {"Feedback": "Empty prediction or reference"}
    
    intersection = pred_tokens & ref_tokens
    precision = len(intersection) / len(pred_tokens)
    recall = len(intersection) / len(ref_tokens)
    
    if precision + recall == 0:
        f1 = 0.0
    else:
        f1 = 2 * precision * recall / (precision + recall)
    
    side_info = {
        "Input": example["input"],
        "Output": prediction,
        "Expected": reference,
        "scores": {
            "f1": f1,
            "precision": precision,
            "recall": recall,
        },
        "Feedback": f"F1: {f1:.2f}, Precision: {precision:.2f}, Recall: {recall:.2f}",
    }
    
    return f1, side_info

Unit Tests

def unit_test_evaluator(candidate, example) -> tuple[float, dict]:
    code = extract_code(run(candidate, example["problem"]))
    
    passed = 0
    failed = 0
    errors = []
    
    for test_case in example["test_cases"]:
        try:
            result = execute_code(code, test_case["input"])
            if result == test_case["expected"]:
                passed += 1
            else:
                failed += 1
                errors.append(
                    f"Input: {test_case['input']}, "
                    f"Expected: {test_case['expected']}, "
                    f"Got: {result}"
                )
        except Exception as e:
            failed += 1
            errors.append(f"Error: {e}")
    
    total = passed + failed
    score = passed / total if total > 0 else 0.0
    
    side_info = {
        "Input": example["problem"],
        "Output": code,
        "scores": {
            "pass_rate": score,
            "tests_passed": passed,
            "tests_failed": failed,
        },
        "Feedback": (
            f"Passed {passed}/{total} tests."
            + (f"\nFailures:\n" + "\n".join(errors) if errors else "")
        ),
    }
    
    return score, side_info

LLM-as-Judge

def llm_judge(candidate, example) -> tuple[float, dict]:
    prediction = run(candidate, example["input"])
    
    judge_prompt = f"""
    Evaluate this response on a scale of 0-10:
    
    Question: {example['input']}
    Response: {prediction}
    
    Criteria:
    - Accuracy
    - Completeness
    - Clarity
    
    Provide a score (0-10) and brief explanation.
    """
    
    judge_response = llm_call(judge_prompt)
    score_raw = extract_score(judge_response)  # Extract number
    score = score_raw / 10.0  # Normalize to 0-1
    
    side_info = {
        "Input": example["input"],
        "Output": prediction,
        "Feedback": judge_response,
        "scores": {"llm_judge_score": score},
    }
    
    return score, side_info

RAG Evaluation

For RAG systems, evaluate both retrieval and generation:

from gepa.adapters.generic_rag_adapter import RAGEvaluationMetrics

metrics = RAGEvaluationMetrics()

def rag_evaluator(candidate, example) -> tuple[float, dict]:
    # Run RAG pipeline
    result = run_rag(candidate, example["query"])
    
    # Evaluate retrieval
    retrieval_metrics = metrics.evaluate_retrieval(
        retrieved_docs=result.retrieved_docs,
        relevant_doc_ids=example["relevant_doc_ids"],
    )
    
    # Evaluate generation
    generation_metrics = metrics.evaluate_generation(
        generated_answer=result.answer,
        ground_truth=example["ground_truth_answer"],
        context=result.context,
    )
    
    # Combined score
    score = metrics.combined_rag_score(
        retrieval_metrics,
        generation_metrics,
        retrieval_weight=0.3,
        generation_weight=0.7,
    )
    
    side_info = {
        "Input": example["query"],
        "Output": result.answer,
        "Expected": example["ground_truth_answer"],
        "scores": {
            **retrieval_metrics,
            **generation_metrics,
        },
        "Feedback": generate_rag_feedback(
            retrieval_metrics,
            generation_metrics,
        ),
    }
    
    return score, side_info

def generate_rag_feedback(retrieval, generation):
    feedback = []
    
    if retrieval["retrieval_recall"] < 0.5:
        feedback.append("Low retrieval recall: relevant docs not retrieved")
    
    if retrieval["retrieval_precision"] < 0.5:
        feedback.append("Low retrieval precision: many irrelevant docs retrieved")
    
    if generation["faithfulness"] < 0.7:
        feedback.append("Answer not well-supported by retrieved context")
    
    if generation["answer_relevance"] < 0.7:
        feedback.append("Answer doesn't use retrieved context effectively")
    
    return " | ".join(feedback) if feedback else "Good performance"

Evaluation with State

Access historical evaluations for warm-starting:

from gepa.optimize_anything import OptimizationState

def evaluator_with_state(
    candidate: str,
    example: dict,
    opt_state: OptimizationState,  # Auto-injected by GEPA
) -> tuple[float, dict]:
    # Get previous best result for this example
    if opt_state.best_example_evals:
        prev_best = opt_state.best_example_evals[0]
        prev_score = prev_best["score"]
        prev_side_info = prev_best["side_info"]
        
        # Use previous result to warm-start
        starting_point = prev_side_info.get("solution_state")
    else:
        starting_point = None
    
    # Run evaluation with warm start
    result = run_with_warmstart(candidate, example, starting_point)
    
    side_info = {
        "Input": example["input"],
        "Output": result.output,
        "solution_state": result.state,  # Save for next iteration
    }
    
    return result.score, side_info

Multi-Stage Evaluation

Break complex evaluations into stages:

import gepa.optimize_anything as oa

def multi_stage_evaluator(candidate, example) -> tuple[float, dict]:
    # Stage 1: Parse
    oa.log("Stage 1: Parsing input...")
    try:
        parsed = parse_input(candidate, example["input"])
        parse_score = 0.2
    except Exception as e:
        oa.log(f"Parse failed: {e}")
        return 0.0, {"Error": f"Parse error: {e}"}
    
    # Stage 2: Plan
    oa.log("Stage 2: Planning solution...")
    plan = create_plan(parsed)
    if is_valid_plan(plan):
        plan_score = 0.3
        oa.log(f"Valid plan: {plan}")
    else:
        plan_score = 0.0
        oa.log("Invalid plan")
    
    # Stage 3: Execute
    oa.log("Stage 3: Executing plan...")
    result = execute_plan(plan)
    exec_score = 0.5 if result.success else 0.0
    oa.log(f"Execution: {'success' if result.success else 'failed'}")
    
    total_score = parse_score + plan_score + exec_score
    
    side_info = {
        "Input": example["input"],
        "Output": result.output,
        "scores": {
            "parse": parse_score / 0.2,  # Normalize
            "plan": plan_score / 0.3,
            "execute": exec_score / 0.5,
        },
    }
    
    return total_score, side_info

Handling Errors

Always return a score, never raise:

def robust_evaluator(candidate, example) -> tuple[float, dict]:
    try:
        result = run(candidate, example)
        score = compute_score(result)
        side_info = {"Output": result, "Feedback": "Success"}
    except TimeoutError:
        score = 0.0
        side_info = {
            "Error": "Execution timeout",
            "Feedback": "Code took too long. Optimize for efficiency.",
        }
    except SyntaxError as e:
        score = 0.0
        side_info = {
            "Error": f"Syntax error: {e}",
            "Feedback": "Fix the syntax error before proceeding.",
        }
    except Exception as e:
        score = 0.0
        side_info = {
            "Error": str(e),
            "Feedback": f"Runtime error: {e}",
        }
    
    return score, side_info

Composite Metrics

Combine multiple metrics:

def composite_evaluator(candidate, example) -> tuple[float, dict]:
    result = run(candidate, example["input"])
    
    # Individual metrics
    correctness = 1.0 if result.correct else 0.0
    efficiency = 1.0 / (result.time_ms + 1)
    readability = compute_readability(result.code)
    
    # Weighted combination
    score = (
        0.5 * correctness +
        0.3 * efficiency +
        0.2 * readability
    )
    
    side_info = {
        "Input": example["input"],
        "Output": result.code,
        "scores": {
            "correctness": correctness,
            "efficiency": efficiency,
            "readability": readability,
        },
        "Feedback": generate_composite_feedback(
            correctness, efficiency, readability
        ),
    }
    
    return score, side_info

Best Practices

1. Always Return Higher-is-Better

Transform metrics like latency, error rate, etc.

2. Provide Granular Scores

Avoid binary 0/1. Give partial credit.

3. Include Rich Side Info

Explain failures with actionable feedback.

4. Handle All Errors

Never raise. Return 0.0 score with error info.

5. Log Intermediate Steps

Use oa.log() for detailed diagnostics.

6. Test Independently

Verify your metric works before optimization.

Next Steps

optimize_anything

Use your evaluator with optimize_anything

Custom Adapters

Build adapters with custom evaluation logic

Configuration

Configure evaluation caching and parallelization

DSPy Integration

DSPy-specific evaluation patterns

Get Started

Core Concepts

Guides

Use Cases

Evaluation Metrics

Evaluator Basics

Score Design Principles

1. Higher is Better

2. Meaningful Scale

3. Granular Feedback

Side Information (ASI)

Basic Structure

Multi-Objective Metrics

Parameter-Specific Info

Common Evaluation Patterns

Exact Match

Contains Match

Token Overlap (F1)

Unit Tests

LLM-as-Judge

RAG Evaluation

Evaluation with State

Multi-Stage Evaluation

Handling Errors

Composite Metrics

Best Practices

Next Steps

optimize_anything

Custom Adapters

Configuration

DSPy Integration

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Use Cases

​Evaluator Basics

​Score Design Principles

​1. Higher is Better

​2. Meaningful Scale

​3. Granular Feedback

​Side Information (ASI)

​Basic Structure

​Multi-Objective Metrics

​Parameter-Specific Info

​Common Evaluation Patterns

​Exact Match

​Contains Match

​Token Overlap (F1)

​Unit Tests

​LLM-as-Judge

​RAG Evaluation

​Evaluation with State

​Multi-Stage Evaluation

​Handling Errors

​Composite Metrics

​Best Practices

​Next Steps

optimize_anything

Custom Adapters

Configuration

DSPy Integration

Build docs developers (and LLMs) love

Evaluator Basics

Score Design Principles

1. Higher is Better

2. Meaningful Scale

3. Granular Feedback

Side Information (ASI)

Basic Structure

Multi-Objective Metrics

Parameter-Specific Info

Common Evaluation Patterns

Exact Match

Contains Match

Token Overlap (F1)

Unit Tests

LLM-as-Judge

RAG Evaluation

Evaluation with State

Multi-Stage Evaluation

Handling Errors

Composite Metrics

Best Practices

Next Steps