Agent Architecture Evolution Tutorial

Learn how to use GEPA’s optimize_anything API to evolve entire agent architectures—not just prompts, but the complete system including code, control flow, sub-agents, and helper functions. This tutorial demonstrates how to nearly triple an agent’s accuracy through architectural evolution.

Overview

Unlike traditional prompt optimization, agent architecture evolution treats the entire agent system as a text artifact to be optimized. This includes:

Agent control flow and decision logic
Sub-agent architectures and coordination
Helper functions and utilities
Prompts and instructions
Error handling and validation

Real-world result: Gemini Flash on ARC-AGI improved from 32.5% to 89.5% test accuracy by evolving from a 10-line naive agent to a 300+ line sophisticated system.

Install GEPA

pip install gepa

The optimize_anything API is available in GEPA’s main package.

Understand the Three Optimization Modes

optimize_anything supports three distinct modes:1. Single-Task Search: Solve one hard problem

oa.optimize_anything(seed_candidate=..., evaluator=...)

2. Multi-Task Search: Solve a batch of related problems with cross-transfer

oa.optimize_anything(seed_candidate=..., evaluator=..., dataset=tasks)

3. Generalization: Build a system that transfers to unseen problems

oa.optimize_anything(seed_candidate=..., evaluator=..., dataset=train, valset=val)

Agent architecture evolution uses Generalization mode—the agent must work on unseen test cases.

Define Your Agent's Seed

Start with a minimal agent implementation:

seed_agent = """
import json
from typing import List, Dict, Any

def solve_puzzle(train_examples: List[Dict], test_input: Any) -> Any:
    """
    Solve an ARC-AGI puzzle given training examples.
    
    Args:
        train_examples: List of {"input": grid, "output": grid} pairs
        test_input: Input grid to solve
    
    Returns:
        Predicted output grid
    """
    # Naive baseline: return first training output
    if train_examples:
        return train_examples[0]["output"]
    return test_input
"""

This 10-line baseline achieves ~30% accuracy. GEPA will evolve it into a sophisticated system.

Create Your Evaluator

The evaluator runs the agent and returns a score plus diagnostic feedback:

import gepa.optimize_anything as oa

def evaluate_agent(candidate: str, example: Dict) -> tuple[float, dict]:
    """
    Evaluate agent on a single puzzle.
    
    Returns:
        (score, diagnostics): Score in [0, 1] and diagnostic information
    """
    try:
        # Execute the agent code
        namespace = {}
        exec(candidate, namespace)
        solve_fn = namespace["solve_puzzle"]
        
        # Run on test input
        prediction = solve_fn(
            train_examples=example["train"],
            test_input=example["test_input"]
        )
        
        # Score accuracy
        ground_truth = example["test_output"]
        correct = (prediction == ground_truth)
        score = 1.0 if correct else 0.0
        
        # Capture diagnostics as ASI (Actionable Side Information)
        diagnostics = {
            "Correct": correct,
            "Prediction": str(prediction)[:200],
            "GroundTruth": str(ground_truth)[:200],
            "TrainExamples": len(example["train"]),
        }
        
        # Log for reflection
        if not correct:
            oa.log(f"Failed on puzzle {example['id']}: predicted {prediction}, expected {ground_truth}")
        
        return score, diagnostics
        
    except Exception as e:
        # Execution errors are valuable feedback
        oa.log(f"Execution error: {type(e).__name__}: {e}")
        return 0.0, {"Error": str(e)}

Key insight: The evaluator returns both a score AND diagnostic feedback. This Actionable Side Information (ASI) helps the LLM understand failures and propose fixes.

Prepare Training and Validation Data

Create datasets of puzzles:

# Load or create ARC-AGI-style puzzles
train_puzzles = [
    {
        "id": "train_001",
        "train": [
            {"input": [[0, 1], [1, 0]], "output": [[1, 0], [0, 1]]},
            {"input": [[1, 1], [0, 0]], "output": [[0, 0], [1, 1]]},
        ],
        "test_input": [[0, 0], [1, 1]],
        "test_output": [[1, 1], [0, 0]],
    },
    # Add 20-50 training puzzles for good results
]

val_puzzles = [
    {
        "id": "val_001",
        "train": [...],
        "test_input": ...,
        "test_output": ...,
    },
    # Add 10-20 validation puzzles
]

print(f"Training puzzles: {len(train_puzzles)}")
print(f"Validation puzzles: {len(val_puzzles)}")

Run Agent Architecture Evolution

Use optimize_anything to evolve the agent:

from gepa.optimize_anything import (
    optimize_anything,
    GEPAConfig,
    EngineConfig,
    ReflectionConfig,
)

result = optimize_anything(
    seed_candidate=seed_agent,
    evaluator=evaluate_agent,
    dataset=train_puzzles,
    valset=val_puzzles,
    objective="""Optimize a Python agent that solves ARC-AGI puzzles.
    The agent should:
    - Analyze training examples to identify patterns
    - Apply learned rules to test inputs
    - Handle various transformation types (rotation, mirroring, color mapping, etc.)
    - Verify outputs when possible
    """,
    background="""ARC-AGI puzzles test abstract reasoning through grid transformations.
    Each puzzle has 2-4 training input/output pairs and one test input.
    Grids are 2D lists of integers (0-9 representing colors).
    Successful agents identify the transformation rule from examples.
    """,
    config=GEPAConfig(
        engine=EngineConfig(
            max_metric_calls=100,  # Budget: 100 evaluations
        ),
        reflection=ReflectionConfig(
            reflection_lm="openai/gpt-4o",  # Strong model for reflection
            reflection_minibatch_size=3,     # Focus on 3 puzzles per iteration
        ),
    ),
)

print(f"\nOptimization complete!")
print(f"Best validation score: {result.val_aggregate_scores[result.best_idx]:.2%}")
print(f"Total iterations: {result.total_metric_calls}")

What happens during optimization:

GEPA evaluates the seed agent on training puzzles
Reflection LLM reads error messages and failed predictions
LLM proposes architectural improvements (new functions, better logic, etc.)
Improved agents are evaluated and selected via Pareto frontier
Process repeats, evolving increasingly sophisticated agents

Review the Evolved Architecture

Examine what GEPA discovered:

print("\nEvolved Agent Architecture:")
print("=" * 60)
print(result.best_candidate[:1000])  # First 1000 chars
print("...")
print(f"\nTotal code size: {len(result.best_candidate)} characters")

# Save the optimized agent
with open("optimized_agent.py", "w") as f:
    f.write(result.best_candidate)
    
print("\nSaved to optimized_agent.py")

Example evolved architecture (from the ARC-AGI experiment):The agent evolved from 10 lines to 300+ lines including:

Rule induction: Analyzes training examples to extract transformation rules
Code generation: Generates Python code to apply rules
Iterative verification: Tests generated code on training examples
Multiple strategies: Tries direct LLM prediction if code generation fails
Structured fallbacks: Graceful degradation when rules are ambiguous

Test on Held-Out Data

Evaluate the optimized agent on test data:

test_puzzles = [...]  # Load unseen test puzzles

test_scores = []
for puzzle in test_puzzles:
    score, _ = evaluate_agent(result.best_candidate, puzzle)
    test_scores.append(score)

test_accuracy = sum(test_scores) / len(test_scores)
print(f"\nTest Accuracy: {test_accuracy:.2%}")

# Compare to baseline
baseline_scores = []
for puzzle in test_puzzles:
    score, _ = evaluate_agent(seed_agent, puzzle)
    baseline_scores.append(score)

baseline_accuracy = sum(baseline_scores) / len(baseline_scores)
print(f"Baseline Accuracy: {baseline_accuracy:.2%}")
print(f"Improvement: {test_accuracy - baseline_accuracy:+.2%}")

Real Results: ARC-AGI Evolution

GEPA achieved dramatic improvements on ARC-AGI puzzles:

Validation Accuracy

Improved from 56.5% to 93.5% on validation set during optimization

Test Accuracy

Improved from 32.5% (naive baseline) to 89.5% on held-out test set

Code Evolution

Evolved from 10-line simple agent to 300+ line sophisticated system

Cost Efficiency

Achieved near-triple accuracy at just 2x cost per task using Gemini Flash

Evolved Architecture Components

The optimized ARC-AGI agent includes:

# 1. Pattern Analysis Module
def analyze_transformation_patterns(train_examples):
    """Extract rules from training input/output pairs"""
    # Identifies: rotations, reflections, color mappings, shape operations
    ...

# 2. Code Generation Module  
def generate_transformation_code(patterns):
    """Generate Python code to apply discovered rules"""
    # Creates executable transformation functions
    ...

# 3. Verification Module
def verify_code_on_examples(code, train_examples):
    """Test generated code on training examples"""
    # Iteratively refines code until it passes training data
    ...

# 4. Multi-Strategy Solver
def solve_puzzle(train_examples, test_input):
    """Main solver with multiple fallback strategies"""
    # Try 1: Pattern-based code generation
    # Try 2: Direct LLM prediction
    # Try 3: Template matching
    ...

Key Concepts

Actionable Side Information (ASI)

ASI is the text-optimization analogue of gradients. It tells the LLM why a candidate failed:

# Bad: Only return a score
return 0.35

# Good: Return score + diagnostics
return 0.35, {
    "Error": "IndexError: list index out of range on line 47",
    "FailedTest": "puzzle_023",
    "Prediction": "[[1, 0], [0, 1]]",
    "Expected": "[[0, 1], [1, 0]]",
}

# Better: Include visual feedback for vision models
from gepa import Image
return 0.35, {
    "Error": "Predicted grid has wrong rotation",
    "PredictionImage": Image(base64_data=render(prediction)),
    "ExpectedImage": Image(base64_data=render(expected)),
}

Pareto-Efficient Search

GEPA maintains a frontier of candidates, preserving any that excel on specific examples:

Agent A: 95% on rotation puzzles, 60% on color mapping
Agent B: 70% on rotation puzzles, 90% on color mapping

Both survive in the frontier. Later iterations combine their strengths.

Seedless Mode

Don’t know where to start? Use seed_candidate=None:

result = optimize_anything(
    seed_candidate=None,  # GEPA bootstraps the first agent
    evaluator=evaluate_agent,
    dataset=train_puzzles,
    valset=val_puzzles,
    objective="Create a Python agent that solves ARC-AGI puzzles...",
    background="""Technical context:
    - Use numpy for grid operations
    - Available libraries: numpy, scipy, skimage
    - Grids are 2D lists of integers 0-9
    - Must define: def solve_puzzle(train_examples, test_input)
    """,
    config=...,
)

The reflection LM writes the initial agent based on your objective and background.

Advanced Examples

Multi-Task Agent Evolution

Optimize across multiple related tasks:

tasks = [
    {"task_type": "rotation", "puzzles": rotation_puzzles},
    {"task_type": "reflection", "puzzles": reflection_puzzles},
    {"task_type": "color_map", "puzzles": color_puzzles},
]

result = optimize_anything(
    seed_candidate=seed_agent,
    evaluator=multi_task_evaluator,
    dataset=tasks,  # Multi-task mode: cross-transfer learning
    objective="Create an agent that handles multiple transformation types",
    config=...,
)

Insights from solving rotation puzzles transfer to reflection puzzles automatically.

Coding Agent Skills

Optimize repository-specific instructions for coding agents:

seed_skills = """
# Codebase Guidelines

## Testing
- Run tests with `go test ./...`
- Write table-driven tests

## Code Style  
- Follow standard Go formatting
- Use meaningful variable names
"""

def evaluate_skills(candidate: str, example: dict) -> tuple[float, dict]:
    """Run coding agent with skills on repository task"""
    # Use candidate skills as instructions to coding agent
    result = run_coding_agent(
        task=example["task"],
        codebase=example["repo"],
        skills=candidate,  # Inject optimized skills
    )
    
    # Test if changes resolve the issue
    tests_pass = run_tests(result.modified_code)
    score = 1.0 if tests_pass else 0.0
    
    return score, {
        "TestsPassed": tests_pass,
        "FilesModified": result.files_modified,
        "Error": result.error if result.error else "None",
    }

result = optimize_anything(
    seed_candidate=seed_skills,
    evaluator=evaluate_skills,
    dataset=train_tasks,  # Repository tasks
    valset=val_tasks,
    objective="Optimize codebase-specific skills for a coding agent",
    background=f"""Repository: {repo_name}
Language: Go
Test command: go test ./...
""",
    config=...,
)

Result: Skills boost Claude Code from 79% to 100% pass rate while reducing time by 47%.

Cloud Scheduling Policies

Discover algorithms that generalize across infrastructure scenarios:

seed_policy = """
def schedule_task(task, spot_price, on_demand_price, deadline):
    # Simple heuristic: use SPOT if deadline allows
    if task.duration * 1.5 < deadline:
        return "SPOT"
    return "ON_DEMAND"
"""

def evaluate_policy(candidate: str, example: dict) -> tuple[float, dict]:
    """Simulate cloud scheduling with this policy"""
    # Run simulation
    result = simulate_workload(
        policy_code=candidate,
        workload=example["workload"],
        spot_trace=example["spot_availability"],
    )
    
    # Score: minimize cost while meeting deadlines
    deadline_misses = sum(1 for task in result.tasks if task.missed_deadline)
    if deadline_misses > 0:
        score = 0.0  # Must meet deadlines
    else:
        baseline_cost = example["on_demand_cost"]
        score = (baseline_cost - result.total_cost) / baseline_cost
    
    return score, {
        "Cost": result.total_cost,
        "Savings": f"{score * 100:.1f}%",
        "DeadlineMisses": deadline_misses,
        "SpotPreemptions": result.spot_preemptions,
    }

result = optimize_anything(
    seed_candidate=seed_policy,
    evaluator=evaluate_policy,
    dataset=train_scenarios,
    valset=val_scenarios,
    objective="Discover a cloud scheduling policy that minimizes cost",
    background="""Available instance types:
    - SPOT: Cheap but can be preempted
    - ON_DEMAND: Reliable but expensive
    
    Must meet task deadlines while minimizing cost.
    """,
    config=...,
)

Result: Discovered policy achieves 7.8% cost savings, beating expert heuristics.

Best Practices

Start with a working baseline

Even a naive 10-line agent is better than starting from scratch. It gives GEPA:

A valid code structure to modify
Baseline performance to beat
Syntax examples for the domain

If you truly have nothing, use seed_candidate=None for seedless mode.

Return rich diagnostic feedback

The quality of ASI directly impacts optimization effectiveness:

Good: Error messages, failed test cases, execution traces
Better: Structured diagnostics showing what went wrong
Best: Visual feedback (rendered outputs) for vision models

Use oa.log() liberally in your evaluator.

Use Pareto-aware minibatching

Set reflection_minibatch_size=2-5 to focus each iteration:

LLM sees 2-5 examples per reflection
Makes targeted improvements for those cases
Pareto frontier preserves specialized gains
Over iterations, all examples get attention

This is more effective than showing all examples every time.

Allocate sufficient budget

Agent architecture evolution needs more iterations than prompt optimization:

Quick test: 20-50 iterations
Good results: 100-200 iterations
Publication quality: 300-500 iterations

Each iteration runs your evaluator on a minibatch (2-5 examples).

Troubleshooting

Code execution errors

Common issues:

Import errors: Include necessary imports in seed
Syntax errors: GEPA will fix these if you log them as ASI
Timeout: Add execution timeout in evaluator

import signal

def evaluate_with_timeout(candidate, example, timeout=30):
    def handler(signum, frame):
        raise TimeoutError("Execution timeout")
    
    signal.signal(signal.SIGALRM, handler)
    signal.alarm(timeout)
    
    try:
        score, diag = evaluate_agent(candidate, example)
    except TimeoutError:
        oa.log("Agent timed out after 30s")
        return 0.0, {"Error": "Timeout"}
    finally:
        signal.alarm(0)
    
    return score, diag

No improvement after many iterations

Possible causes:

Poor ASI: Make diagnostics more informative
Weak reflection model: Try GPT-4o or o1
Insufficient examples: Add more diverse training data
Wrong objective: Clarify what you want in objective parameter

Enable verbose logging to see what’s happening:

config = GEPAConfig(
    tracking=TrackingConfig(verbose=True)
)

Next Steps

optimize_anything API

Complete API reference with all parameters

Blog Post

Detailed blog post with 8 case studies

Coding Agent Skills

Learn how to optimize skills for coding agents

GEPA Paper

Research paper with methodology and results

Learn More

ARC-AGI Example: Full code for the agent evolution demo
CUDA Kernels: Multi-task optimization generating fast GPU code
Cloud Policies: CloudCast and Can’t Be Late scheduling algorithms
3D Unicorn: Seedless mode example generating 3D models from scratch

Tutorials

Integrations

Agent Architecture Evolution

Agent Architecture Evolution Tutorial

Overview

Real Results: ARC-AGI Evolution

Validation Accuracy

Test Accuracy

Code Evolution

Cost Efficiency

Evolved Architecture Components

Key Concepts

Actionable Side Information (ASI)

Pareto-Efficient Search

Seedless Mode

Advanced Examples

Multi-Task Agent Evolution

Coding Agent Skills

Cloud Scheduling Policies

Best Practices

Troubleshooting

Next Steps

optimize_anything API

Blog Post

Coding Agent Skills

GEPA Paper

Learn More

Build docs developers (and LLMs) love

Tutorials

Integrations

​Agent Architecture Evolution Tutorial

​Overview

​Real Results: ARC-AGI Evolution

Validation Accuracy

Test Accuracy

Code Evolution

Cost Efficiency

​Evolved Architecture Components

​Key Concepts

​Actionable Side Information (ASI)

​Pareto-Efficient Search

​Seedless Mode

​Advanced Examples

​Multi-Task Agent Evolution

​Coding Agent Skills

​Cloud Scheduling Policies

​Best Practices

​Troubleshooting

​Next Steps

optimize_anything API

Blog Post

Coding Agent Skills

GEPA Paper

​Learn More

Build docs developers (and LLMs) love

Agent Architecture Evolution Tutorial

Overview

Real Results: ARC-AGI Evolution

Evolved Architecture Components

Key Concepts

Actionable Side Information (ASI)

Pareto-Efficient Search

Seedless Mode

Advanced Examples

Multi-Task Agent Evolution

Coding Agent Skills

Cloud Scheduling Policies

Best Practices

Troubleshooting

Next Steps

Learn More