Skip to main content

Agent Architecture Evolution Tutorial

Learn how to use GEPA’s optimize_anything API to evolve entire agent architectures—not just prompts, but the complete system including code, control flow, sub-agents, and helper functions. This tutorial demonstrates how to nearly triple an agent’s accuracy through architectural evolution.

Overview

Unlike traditional prompt optimization, agent architecture evolution treats the entire agent system as a text artifact to be optimized. This includes:
  • Agent control flow and decision logic
  • Sub-agent architectures and coordination
  • Helper functions and utilities
  • Prompts and instructions
  • Error handling and validation
Real-world result: Gemini Flash on ARC-AGI improved from 32.5% to 89.5% test accuracy by evolving from a 10-line naive agent to a 300+ line sophisticated system.
1

Install GEPA

pip install gepa
The optimize_anything API is available in GEPA’s main package.
2

Understand the Three Optimization Modes

optimize_anything supports three distinct modes:1. Single-Task Search: Solve one hard problem
oa.optimize_anything(seed_candidate=..., evaluator=...)
2. Multi-Task Search: Solve a batch of related problems with cross-transfer
oa.optimize_anything(seed_candidate=..., evaluator=..., dataset=tasks)
3. Generalization: Build a system that transfers to unseen problems
oa.optimize_anything(seed_candidate=..., evaluator=..., dataset=train, valset=val)
Agent architecture evolution uses Generalization mode—the agent must work on unseen test cases.
3

Define Your Agent's Seed

Start with a minimal agent implementation:
seed_agent = """
import json
from typing import List, Dict, Any

def solve_puzzle(train_examples: List[Dict], test_input: Any) -> Any:
    """
    Solve an ARC-AGI puzzle given training examples.
    
    Args:
        train_examples: List of {"input": grid, "output": grid} pairs
        test_input: Input grid to solve
    
    Returns:
        Predicted output grid
    """
    # Naive baseline: return first training output
    if train_examples:
        return train_examples[0]["output"]
    return test_input
"""
This 10-line baseline achieves ~30% accuracy. GEPA will evolve it into a sophisticated system.
4

Create Your Evaluator

The evaluator runs the agent and returns a score plus diagnostic feedback:
import gepa.optimize_anything as oa

def evaluate_agent(candidate: str, example: Dict) -> tuple[float, dict]:
    """
    Evaluate agent on a single puzzle.
    
    Returns:
        (score, diagnostics): Score in [0, 1] and diagnostic information
    """
    try:
        # Execute the agent code
        namespace = {}
        exec(candidate, namespace)
        solve_fn = namespace["solve_puzzle"]
        
        # Run on test input
        prediction = solve_fn(
            train_examples=example["train"],
            test_input=example["test_input"]
        )
        
        # Score accuracy
        ground_truth = example["test_output"]
        correct = (prediction == ground_truth)
        score = 1.0 if correct else 0.0
        
        # Capture diagnostics as ASI (Actionable Side Information)
        diagnostics = {
            "Correct": correct,
            "Prediction": str(prediction)[:200],
            "GroundTruth": str(ground_truth)[:200],
            "TrainExamples": len(example["train"]),
        }
        
        # Log for reflection
        if not correct:
            oa.log(f"Failed on puzzle {example['id']}: predicted {prediction}, expected {ground_truth}")
        
        return score, diagnostics
        
    except Exception as e:
        # Execution errors are valuable feedback
        oa.log(f"Execution error: {type(e).__name__}: {e}")
        return 0.0, {"Error": str(e)}
Key insight: The evaluator returns both a score AND diagnostic feedback. This Actionable Side Information (ASI) helps the LLM understand failures and propose fixes.
5

Prepare Training and Validation Data

Create datasets of puzzles:
# Load or create ARC-AGI-style puzzles
train_puzzles = [
    {
        "id": "train_001",
        "train": [
            {"input": [[0, 1], [1, 0]], "output": [[1, 0], [0, 1]]},
            {"input": [[1, 1], [0, 0]], "output": [[0, 0], [1, 1]]},
        ],
        "test_input": [[0, 0], [1, 1]],
        "test_output": [[1, 1], [0, 0]],
    },
    # Add 20-50 training puzzles for good results
]

val_puzzles = [
    {
        "id": "val_001",
        "train": [...],
        "test_input": ...,
        "test_output": ...,
    },
    # Add 10-20 validation puzzles
]

print(f"Training puzzles: {len(train_puzzles)}")
print(f"Validation puzzles: {len(val_puzzles)}")
6

Run Agent Architecture Evolution

Use optimize_anything to evolve the agent:
from gepa.optimize_anything import (
    optimize_anything,
    GEPAConfig,
    EngineConfig,
    ReflectionConfig,
)

result = optimize_anything(
    seed_candidate=seed_agent,
    evaluator=evaluate_agent,
    dataset=train_puzzles,
    valset=val_puzzles,
    objective="""Optimize a Python agent that solves ARC-AGI puzzles.
    The agent should:
    - Analyze training examples to identify patterns
    - Apply learned rules to test inputs
    - Handle various transformation types (rotation, mirroring, color mapping, etc.)
    - Verify outputs when possible
    """,
    background="""ARC-AGI puzzles test abstract reasoning through grid transformations.
    Each puzzle has 2-4 training input/output pairs and one test input.
    Grids are 2D lists of integers (0-9 representing colors).
    Successful agents identify the transformation rule from examples.
    """,
    config=GEPAConfig(
        engine=EngineConfig(
            max_metric_calls=100,  # Budget: 100 evaluations
        ),
        reflection=ReflectionConfig(
            reflection_lm="openai/gpt-4o",  # Strong model for reflection
            reflection_minibatch_size=3,     # Focus on 3 puzzles per iteration
        ),
    ),
)

print(f"\nOptimization complete!")
print(f"Best validation score: {result.val_aggregate_scores[result.best_idx]:.2%}")
print(f"Total iterations: {result.total_metric_calls}")
What happens during optimization:
  1. GEPA evaluates the seed agent on training puzzles
  2. Reflection LLM reads error messages and failed predictions
  3. LLM proposes architectural improvements (new functions, better logic, etc.)
  4. Improved agents are evaluated and selected via Pareto frontier
  5. Process repeats, evolving increasingly sophisticated agents
7

Review the Evolved Architecture

Examine what GEPA discovered:
print("\nEvolved Agent Architecture:")
print("=" * 60)
print(result.best_candidate[:1000])  # First 1000 chars
print("...")
print(f"\nTotal code size: {len(result.best_candidate)} characters")

# Save the optimized agent
with open("optimized_agent.py", "w") as f:
    f.write(result.best_candidate)
    
print("\nSaved to optimized_agent.py")
Example evolved architecture (from the ARC-AGI experiment):The agent evolved from 10 lines to 300+ lines including:
  • Rule induction: Analyzes training examples to extract transformation rules
  • Code generation: Generates Python code to apply rules
  • Iterative verification: Tests generated code on training examples
  • Multiple strategies: Tries direct LLM prediction if code generation fails
  • Structured fallbacks: Graceful degradation when rules are ambiguous
8

Test on Held-Out Data

Evaluate the optimized agent on test data:
test_puzzles = [...]  # Load unseen test puzzles

test_scores = []
for puzzle in test_puzzles:
    score, _ = evaluate_agent(result.best_candidate, puzzle)
    test_scores.append(score)

test_accuracy = sum(test_scores) / len(test_scores)
print(f"\nTest Accuracy: {test_accuracy:.2%}")

# Compare to baseline
baseline_scores = []
for puzzle in test_puzzles:
    score, _ = evaluate_agent(seed_agent, puzzle)
    baseline_scores.append(score)

baseline_accuracy = sum(baseline_scores) / len(baseline_scores)
print(f"Baseline Accuracy: {baseline_accuracy:.2%}")
print(f"Improvement: {test_accuracy - baseline_accuracy:+.2%}")

Real Results: ARC-AGI Evolution

GEPA achieved dramatic improvements on ARC-AGI puzzles:

Validation Accuracy

Improved from 56.5% to 93.5% on validation set during optimization

Test Accuracy

Improved from 32.5% (naive baseline) to 89.5% on held-out test set

Code Evolution

Evolved from 10-line simple agent to 300+ line sophisticated system

Cost Efficiency

Achieved near-triple accuracy at just 2x cost per task using Gemini Flash

Evolved Architecture Components

The optimized ARC-AGI agent includes:
# 1. Pattern Analysis Module
def analyze_transformation_patterns(train_examples):
    """Extract rules from training input/output pairs"""
    # Identifies: rotations, reflections, color mappings, shape operations
    ...

# 2. Code Generation Module  
def generate_transformation_code(patterns):
    """Generate Python code to apply discovered rules"""
    # Creates executable transformation functions
    ...

# 3. Verification Module
def verify_code_on_examples(code, train_examples):
    """Test generated code on training examples"""
    # Iteratively refines code until it passes training data
    ...

# 4. Multi-Strategy Solver
def solve_puzzle(train_examples, test_input):
    """Main solver with multiple fallback strategies"""
    # Try 1: Pattern-based code generation
    # Try 2: Direct LLM prediction
    # Try 3: Template matching
    ...

Key Concepts

Actionable Side Information (ASI)

ASI is the text-optimization analogue of gradients. It tells the LLM why a candidate failed:
# Bad: Only return a score
return 0.35

# Good: Return score + diagnostics
return 0.35, {
    "Error": "IndexError: list index out of range on line 47",
    "FailedTest": "puzzle_023",
    "Prediction": "[[1, 0], [0, 1]]",
    "Expected": "[[0, 1], [1, 0]]",
}

# Better: Include visual feedback for vision models
from gepa import Image
return 0.35, {
    "Error": "Predicted grid has wrong rotation",
    "PredictionImage": Image(base64_data=render(prediction)),
    "ExpectedImage": Image(base64_data=render(expected)),
}
GEPA maintains a frontier of candidates, preserving any that excel on specific examples:
  • Agent A: 95% on rotation puzzles, 60% on color mapping
  • Agent B: 70% on rotation puzzles, 90% on color mapping
Both survive in the frontier. Later iterations combine their strengths.

Seedless Mode

Don’t know where to start? Use seed_candidate=None:
result = optimize_anything(
    seed_candidate=None,  # GEPA bootstraps the first agent
    evaluator=evaluate_agent,
    dataset=train_puzzles,
    valset=val_puzzles,
    objective="Create a Python agent that solves ARC-AGI puzzles...",
    background="""Technical context:
    - Use numpy for grid operations
    - Available libraries: numpy, scipy, skimage
    - Grids are 2D lists of integers 0-9
    - Must define: def solve_puzzle(train_examples, test_input)
    """,
    config=...,
)
The reflection LM writes the initial agent based on your objective and background.

Advanced Examples

Multi-Task Agent Evolution

Optimize across multiple related tasks:
tasks = [
    {"task_type": "rotation", "puzzles": rotation_puzzles},
    {"task_type": "reflection", "puzzles": reflection_puzzles},
    {"task_type": "color_map", "puzzles": color_puzzles},
]

result = optimize_anything(
    seed_candidate=seed_agent,
    evaluator=multi_task_evaluator,
    dataset=tasks,  # Multi-task mode: cross-transfer learning
    objective="Create an agent that handles multiple transformation types",
    config=...,
)
Insights from solving rotation puzzles transfer to reflection puzzles automatically.

Coding Agent Skills

Optimize repository-specific instructions for coding agents:
seed_skills = """
# Codebase Guidelines

## Testing
- Run tests with `go test ./...`
- Write table-driven tests

## Code Style  
- Follow standard Go formatting
- Use meaningful variable names
"""

def evaluate_skills(candidate: str, example: dict) -> tuple[float, dict]:
    """Run coding agent with skills on repository task"""
    # Use candidate skills as instructions to coding agent
    result = run_coding_agent(
        task=example["task"],
        codebase=example["repo"],
        skills=candidate,  # Inject optimized skills
    )
    
    # Test if changes resolve the issue
    tests_pass = run_tests(result.modified_code)
    score = 1.0 if tests_pass else 0.0
    
    return score, {
        "TestsPassed": tests_pass,
        "FilesModified": result.files_modified,
        "Error": result.error if result.error else "None",
    }

result = optimize_anything(
    seed_candidate=seed_skills,
    evaluator=evaluate_skills,
    dataset=train_tasks,  # Repository tasks
    valset=val_tasks,
    objective="Optimize codebase-specific skills for a coding agent",
    background=f"""Repository: {repo_name}
Language: Go
Test command: go test ./...
""",
    config=...,
)
Result: Skills boost Claude Code from 79% to 100% pass rate while reducing time by 47%.

Cloud Scheduling Policies

Discover algorithms that generalize across infrastructure scenarios:
seed_policy = """
def schedule_task(task, spot_price, on_demand_price, deadline):
    # Simple heuristic: use SPOT if deadline allows
    if task.duration * 1.5 < deadline:
        return "SPOT"
    return "ON_DEMAND"
"""

def evaluate_policy(candidate: str, example: dict) -> tuple[float, dict]:
    """Simulate cloud scheduling with this policy"""
    # Run simulation
    result = simulate_workload(
        policy_code=candidate,
        workload=example["workload"],
        spot_trace=example["spot_availability"],
    )
    
    # Score: minimize cost while meeting deadlines
    deadline_misses = sum(1 for task in result.tasks if task.missed_deadline)
    if deadline_misses > 0:
        score = 0.0  # Must meet deadlines
    else:
        baseline_cost = example["on_demand_cost"]
        score = (baseline_cost - result.total_cost) / baseline_cost
    
    return score, {
        "Cost": result.total_cost,
        "Savings": f"{score * 100:.1f}%",
        "DeadlineMisses": deadline_misses,
        "SpotPreemptions": result.spot_preemptions,
    }

result = optimize_anything(
    seed_candidate=seed_policy,
    evaluator=evaluate_policy,
    dataset=train_scenarios,
    valset=val_scenarios,
    objective="Discover a cloud scheduling policy that minimizes cost",
    background="""Available instance types:
    - SPOT: Cheap but can be preempted
    - ON_DEMAND: Reliable but expensive
    
    Must meet task deadlines while minimizing cost.
    """,
    config=...,
)
Result: Discovered policy achieves 7.8% cost savings, beating expert heuristics.

Best Practices

Even a naive 10-line agent is better than starting from scratch. It gives GEPA:
  • A valid code structure to modify
  • Baseline performance to beat
  • Syntax examples for the domain
If you truly have nothing, use seed_candidate=None for seedless mode.
The quality of ASI directly impacts optimization effectiveness:
  • Good: Error messages, failed test cases, execution traces
  • Better: Structured diagnostics showing what went wrong
  • Best: Visual feedback (rendered outputs) for vision models
Use oa.log() liberally in your evaluator.
Set reflection_minibatch_size=2-5 to focus each iteration:
  • LLM sees 2-5 examples per reflection
  • Makes targeted improvements for those cases
  • Pareto frontier preserves specialized gains
  • Over iterations, all examples get attention
This is more effective than showing all examples every time.
Agent architecture evolution needs more iterations than prompt optimization:
  • Quick test: 20-50 iterations
  • Good results: 100-200 iterations
  • Publication quality: 300-500 iterations
Each iteration runs your evaluator on a minibatch (2-5 examples).

Troubleshooting

Common issues:
  • Import errors: Include necessary imports in seed
  • Syntax errors: GEPA will fix these if you log them as ASI
  • Timeout: Add execution timeout in evaluator
import signal

def evaluate_with_timeout(candidate, example, timeout=30):
    def handler(signum, frame):
        raise TimeoutError("Execution timeout")
    
    signal.signal(signal.SIGALRM, handler)
    signal.alarm(timeout)
    
    try:
        score, diag = evaluate_agent(candidate, example)
    except TimeoutError:
        oa.log("Agent timed out after 30s")
        return 0.0, {"Error": "Timeout"}
    finally:
        signal.alarm(0)
    
    return score, diag
Possible causes:
  • Poor ASI: Make diagnostics more informative
  • Weak reflection model: Try GPT-4o or o1
  • Insufficient examples: Add more diverse training data
  • Wrong objective: Clarify what you want in objective parameter
Enable verbose logging to see what’s happening:
config = GEPAConfig(
    tracking=TrackingConfig(verbose=True)
)

Next Steps

optimize_anything API

Complete API reference with all parameters

Blog Post

Detailed blog post with 8 case studies

Coding Agent Skills

Learn how to optimize skills for coding agents

GEPA Paper

Research paper with methodology and results

Learn More

Build docs developers (and LLMs) love