Skip to main content
GEPA’s most ambitious application: optimizing entire agent systems. Rather than just tuning prompts, GEPA evolves the complete agent architecture including code, sub-agents, control flow, helper functions, and prompts.

Key Result: ARC-AGI

Test Accuracy

32.5% → 89.5% on ARC-AGI v1 public test (+57 pp)

Validation Accuracy

56.5% → 93.5% on validation set

Model Used

Gemini 3 Flash — improvements via architecture, not model size

Cost Efficiency

2x cost per task vs naive agent, 3x accuracy improvement

What Gets Optimized?

Unlike prompt optimization which tunes instructions, agent architecture discovery optimizes:
  • System architecture: Multi-stage pipelines, sub-agent orchestration
  • Code implementations: Helper functions, validation logic, fallback strategies
  • Control flow: Retry mechanisms, iterative refinement, branching logic
  • Prompts: Instructions for each sub-agent
  • Error handling: Recovery strategies and graceful degradation
The entire agent is treated as a single text artifact that GEPA evolves through reflection.

ARC-AGI Case Study

The Challenge

ARC-AGI tests abstract reasoning through visual grid transformation puzzles. It requires:
  • Pattern recognition
  • Rule induction from few examples
  • Generating executable code to transform grids
  • Validation and iterative refinement

Initial Agent (Seed)

A naive 10-line agent:
def solve_arc_task(task):
    """Naive agent: direct prompt to LLM."""
    prompt = f"""
    Given these input-output examples:
    {format_examples(task['train'])}
    
    Apply the pattern to: {task['test_input']}
    """
    
    response = llm.generate(prompt)
    return parse_output(response)
Baseline accuracy: 32.5% on test set

Evolved Agent (300+ lines)

After GEPA optimization:
def solve_arc_task(task):
    """Sophisticated multi-stage agent with rule induction and verification."""
    
    # Stage 1: Rule Induction
    rules = induce_transformation_rules(task['train'])
    
    # Stage 2: Code Generation (Primary Attempt)
    code_solution = generate_transformation_code(
        examples=task['train'],
        rules=rules,
        test_input=task['test_input']
    )
    
    # Stage 3: Iterative Validation
    for iteration in range(3):
        validation_result = validate_on_training_examples(
            code=code_solution,
            examples=task['train']
        )
        
        if validation_result.all_correct:
            break
            
        # Refine code based on failures
        code_solution = refine_code(
            code=code_solution,
            failures=validation_result.failures,
            rules=rules
        )
    
    # Stage 4: Execute on Test Input
    try:
        prediction = execute_code(code_solution, task['test_input'])
    except Exception as e:
        # Fallback: Direct LLM prediction
        prediction = llm_direct_prediction(
            task=task,
            error_context=str(e)
        )
    
    return prediction


def induce_transformation_rules(examples):
    """Extract patterns from training examples."""
    prompt = f"""
    Analyze these input-output grid transformations:
    {format_examples(examples)}
    
    Extract the transformation rules:
    - What patterns appear?
    - What geometric operations are applied?
    - Are there color transformations?
    - What is the output size relative to input?
    
    List 3-5 specific, testable rules.
    """
    return llm.generate(prompt)


def generate_transformation_code(examples, rules, test_input):
    """Generate Python code implementing the transformation."""
    prompt = f"""
    Given transformation rules:
    {rules}
    
    And these examples:
    {format_examples(examples)}
    
    Write Python code to transform:
    {test_input}
    
    Requirements:
    - Function signature: transform(grid) -> grid
    - Use numpy for grid operations
    - Handle edge cases gracefully
    - Include comments explaining each step
    """
    return llm.generate(prompt)


def validate_on_training_examples(code, examples):
    """Execute code on training examples and check correctness."""
    failures = []
    
    for i, example in enumerate(examples):
        try:
            prediction = execute_code(code, example['input'])
            
            if not grids_match(prediction, example['output']):
                failures.append({
                    'example_id': i,
                    'expected': example['output'],
                    'got': prediction,
                    'diff': compute_diff(prediction, example['output'])
                })
        except Exception as e:
            failures.append({
                'example_id': i,
                'error': str(e),
                'traceback': traceback.format_exc()
            })
    
    return ValidationResult(
        all_correct=(len(failures) == 0),
        failures=failures
    )


def refine_code(code, failures, rules):
    """Improve code based on validation failures."""
    prompt = f"""
    Your code:
    {code}
    
    Failed on these cases:
    {format_failures(failures)}
    
    Original rules:
    {rules}
    
    Fix the code to handle these failures while maintaining correctness on 
    passing examples. Focus on:
    - Logic errors in transformation
    - Edge cases (boundary conditions, empty cells)
    - Off-by-one errors in grid indexing
    """
    return llm.generate(prompt)


def llm_direct_prediction(task, error_context=None):
    """Fallback: Direct LLM prediction without code execution."""
    prompt = f"""
    Task examples:
    {format_examples(task['train'])}
    
    {'Code generation failed: ' + error_context if error_context else ''}
    
    Predict the output grid for:
    {task['test_input']}
    
    Output the grid directly as a 2D array.
    """
    return llm.generate(prompt)

Architecture Evolution

1

Iteration 0-20: Rule Induction

GEPA discovers that explicitly extracting transformation rules improves code generation accuracy.
2

Iteration 20-50: Validation Loop

Adds iterative validation on training examples with targeted refinement of failing code.
3

Iteration 50-80: Structured Fallbacks

Introduces graceful degradation: when code execution fails, fall back to direct LLM prediction.
4

Iteration 80-100: Multi-Attempt Strategy

Evolves two-attempt prediction (code + direct LLM) and uses voting/confidence to select final answer.

Optimization Trajectory

The graph shows validation accuracy improving from 56.5% to 93.5% over 100 metric calls, with test accuracy reaching 89.5%. Key inflection points:
  • Metric call 20: Validation jumps from 56% → 72% when rule induction is added
  • Metric call 50: Reaches 85% with validation loop
  • Metric call 80: Breaks 90% with structured fallbacks

How It Works

1. Evaluator

The evaluator runs the agent on ARC-AGI puzzles and returns detailed diagnostics:
import gepa.optimize_anything as oa

def evaluate_arc_agent(candidate: dict, example: dict) -> tuple[float, dict]:
    """
    Run agent code on an ARC-AGI task.
    
    Args:
        candidate: Dict with 'agent_code' key containing full agent implementation
        example: ARC-AGI task with 'train' examples and 'test' input/output
    
    Returns:
        (score, side_info) where score is 1.0 for correct, 0.0 for incorrect
    """
    # Execute agent code
    exec_globals = {}
    exec(candidate['agent_code'], exec_globals)
    
    solve_fn = exec_globals['solve_arc_task']
    
    try:
        prediction = solve_fn(example)
        correct = grids_match(prediction, example['test_output'])
        
        return float(correct), {
            "Correct": correct,
            "Prediction": format_grid(prediction),
            "Expected": format_grid(example['test_output']),
            "VisualDiff": render_diff(prediction, example['test_output']),
        }
        
    except Exception as e:
        return 0.0, {
            "Error": str(e),
            "Traceback": traceback.format_exc(),
            "Expected": format_grid(example['test_output']),
        }

2. Optimization Call

from gepa.optimize_anything import (
    optimize_anything, GEPAConfig, EngineConfig, ReflectionConfig
)

# Load ARC-AGI dataset
train_tasks, val_tasks, test_tasks = load_arc_agi_dataset()

# Seed: naive agent
seed_agent = {
    "agent_code": open("naive_arc_agent.py").read()
}

result = optimize_anything(
    seed_candidate=seed_agent,
    evaluator=evaluate_arc_agent,
    dataset=train_tasks,
    valset=val_tasks,
    objective="""
    Optimize the ARC-AGI agent to maximize accuracy on abstract reasoning tasks.
    The agent should induce rules from examples and generate/validate code.
    """,
    background="""
    ARC-AGI tasks involve grid transformations with abstract patterns.
    Strong agents typically:
    - Extract transformation rules explicitly
    - Generate executable code rather than direct predictions
    - Validate code on training examples before applying to test
    - Have fallback strategies when code fails
    """,
    config=GEPAConfig(
        engine=EngineConfig(max_metric_calls=100),
        reflection=ReflectionConfig(
            reflection_lm="vertex_ai/gemini-3-flash-preview",
            reflection_minibatch_size=3,  # Reflect on 3 tasks at a time
        ),
    ),
)

print("Final agent code:")
print(result.best_candidate['agent_code'])

print(f"\nTest accuracy: {evaluate_on_testset(result.best_candidate, test_tasks)}")

3. Reflection Process

During reflection, GEPA shows the LLM:
  • Current agent code
  • Execution results on a minibatch (3 tasks)
  • Failures: incorrect predictions, errors, timeouts
  • Successes: what worked and why
The LLM proposes targeted improvements:
Analyzing agent performance on tasks [42, 73, 91]:

Task 42 - FAILED
- Agent generated code with IndexError on boundary cells
- The rule induction correctly identified "expand colored cells by 1 step"
- But code used grid[i+1, j] without checking bounds
- Fix: Add boundary checks or use np.pad

Task 73 - FAILED  
- Agent's direct LLM prediction was close but wrong
- The code validation caught an error and fell back to LLM
- Issue: Fallback didn't use the extracted rules
- Fix: Pass rules to fallback function

Task 91 - PASSED
- Code generation + validation loop succeeded
- Refined code twice before getting correct output
- This multi-stage approach is effective

Proposed mutation:
1. Add np.pad wrapper in code template for boundary safety
2. Pass rules to llm_direct_prediction fallback
3. Increase max_refinement_iterations from 2 to 3

Other Agent Architecture Examples

Multi-Agent RAG System

From the healthcare RAG case study:
seed_agent = {
    "architecture": """
    # Simple RAG
    def answer_query(query, documents):
        relevant_docs = retrieve(query, documents, top_k=5)
        context = '\n'.join(relevant_docs)
        return llm.generate(f"Context: {context}\nQuestion: {query}")
    """
}

# After optimization → Multi-agent system:
optimized_agent = {
    "architecture": """
    # Specialized disease-expert sub-agents
    def answer_query(query, documents):
        # Route to specialist
        disease = classify_query(query)  # diabetes vs COPD
        
        if disease == 'diabetes':
            expert = diabetes_expert
        elif disease == 'copd':
            expert = copd_expert
        else:
            expert = general_expert
        
        # Expert does specialized retrieval + reasoning
        relevant_docs = expert.retrieve(query, documents)
        answer = expert.reason(query, relevant_docs)
        
        # Lead agent synthesizes and validates
        return lead_agent.synthesize(query, answer, disease_context=disease)
    """
}
Result: Improved retrieval precision and answer quality through specialized sub-agents. Read the full case study →

Terminal-Use Agent (Terminus)

GEPA optimizes the system prompt for the Terminus terminal-use agent:
from gepa.adapters.terminal_bench_adapter import TerminalBenchAdapter

adapter = TerminalBenchAdapter(
    task_lm="openai/gpt-4.5",
    terminal_bench_path="/path/to/terminalbench",
)

result = gepa.optimize(
    adapter=adapter,
    trainset=train_tasks,
    valset=val_tasks,
    max_metric_calls=100,
)
Result: Improved command success rate through optimized agent instructions. View the adapter →

Production Incident Diagnosis

Arc.computer’s ATLAS system uses GEPA-optimized agents for production incident diagnosis:

Root Cause Analysis

Automated RCA for production incidents

Dynamic Data Collection

Collects logs, metrics, and traces on-demand

RL Augmentation

+142% student performance when RL-tuned teacher is improved with GEPA

Reduced On-Call Burden

Less manual work for on-call engineers
ATLAS demonstrates that GEPA works alongside RL, not just as an alternative:
  1. Start with RL-tuned teacher model
  2. Apply GEPA to optimize teacher’s prompts/architecture
  3. Train student model from improved teacher
  4. Result: +142% improvement over RL-tuned baseline
Read the ATLAS blog →

Advantages of Architecture Discovery

Automates Design Iteration

No manual architecture search — GEPA explores the design space

Discovers Non-Obvious Patterns

Finds strategies humans might miss (e.g., multi-stage validation)

Task-Specific Optimization

Architecture adapts to the specific domain (ARC-AGI vs terminal use vs RAG)

Interpretable

Full agent code is readable — understand why it works

Best Practices

Even a naive baseline (10 lines) is enough. GEPA will evolve complexity as needed.
Include error messages, intermediate results, timing info — anything that helps diagnose failures.
Essential for generalization mode. Prevents overfitting to training tasks.
Default is 2-3 tasks per reflection. Increase for diverse task distributions, decrease for similar tasks.
Provide domain context in the background parameter to guide evolution toward good architectures.
Check validation scores during optimization to catch overfitting early.

Seedless Architecture Discovery

Don’t have a starting agent? Use seedless mode:
result = optimize_anything(
    seed_candidate=None,  # No seed!
    evaluator=evaluate_agent,
    dataset=train_tasks,
    valset=val_tasks,
    objective="""
    Create an agent to solve ARC-AGI abstract reasoning tasks.
    The agent should analyze input-output examples and predict test outputs.
    """,
    background="""
    ARC-AGI involves 2D grid transformations with abstract patterns.
    Successful approaches typically:
    - Extract transformation rules from examples
    - Use executable code rather than direct prediction
    - Validate generated code on training examples
    - Include fallback strategies
    
    You have access to: numpy, standard library, LLM API via llm.generate()
    """,
)
GEPA’s reflection LM will bootstrap the first agent from scratch based on your objective and background.

Comparison: Prompt Optimization vs Architecture Discovery

AspectPrompt OptimizationArchitecture Discovery
What’s optimizedInstructions/promptsComplete agent code
Typical size100-500 tokens100-500 lines of code
Structural changesNoYes — control flow, functions, sub-agents
Complexity growthPrompt elaborationArchitectural evolution
ExampleAIME math promptARC-AGI agent system
Speedup over RL35x35x
Typical accuracy gains10-20 pp30-60 pp

Next Steps

Try the ARC-AGI Tutorial

Step-by-step agent architecture optimization

RAG Optimization

Optimize retrieval pipelines

Code Optimization

Generate and optimize code

API Reference

Complete API documentation

Build docs developers (and LLMs) love