Agent Architecture Evolution Tutorial
Learn how to use GEPA’soptimize_anything API to evolve entire agent architectures—not just prompts, but the complete system including code, control flow, sub-agents, and helper functions. This tutorial demonstrates how to nearly triple an agent’s accuracy through architectural evolution.
Overview
Unlike traditional prompt optimization, agent architecture evolution treats the entire agent system as a text artifact to be optimized. This includes:- Agent control flow and decision logic
- Sub-agent architectures and coordination
- Helper functions and utilities
- Prompts and instructions
- Error handling and validation
Understand the Three Optimization Modes
optimize_anything supports three distinct modes:1. Single-Task Search: Solve one hard problemDefine Your Agent's Seed
Start with a minimal agent implementation:This 10-line baseline achieves ~30% accuracy. GEPA will evolve it into a sophisticated system.
Create Your Evaluator
The evaluator runs the agent and returns a score plus diagnostic feedback:Key insight: The evaluator returns both a score AND diagnostic feedback. This Actionable Side Information (ASI) helps the LLM understand failures and propose fixes.
Run Agent Architecture Evolution
Use What happens during optimization:
optimize_anything to evolve the agent:- GEPA evaluates the seed agent on training puzzles
- Reflection LLM reads error messages and failed predictions
- LLM proposes architectural improvements (new functions, better logic, etc.)
- Improved agents are evaluated and selected via Pareto frontier
- Process repeats, evolving increasingly sophisticated agents
Review the Evolved Architecture
Examine what GEPA discovered:Example evolved architecture (from the ARC-AGI experiment):The agent evolved from 10 lines to 300+ lines including:
- Rule induction: Analyzes training examples to extract transformation rules
- Code generation: Generates Python code to apply rules
- Iterative verification: Tests generated code on training examples
- Multiple strategies: Tries direct LLM prediction if code generation fails
- Structured fallbacks: Graceful degradation when rules are ambiguous
Real Results: ARC-AGI Evolution
GEPA achieved dramatic improvements on ARC-AGI puzzles:Validation Accuracy
Improved from 56.5% to 93.5% on validation set during optimization
Test Accuracy
Improved from 32.5% (naive baseline) to 89.5% on held-out test set
Code Evolution
Evolved from 10-line simple agent to 300+ line sophisticated system
Cost Efficiency
Achieved near-triple accuracy at just 2x cost per task using Gemini Flash
Evolved Architecture Components
The optimized ARC-AGI agent includes:Key Concepts
Actionable Side Information (ASI)
ASI is the text-optimization analogue of gradients. It tells the LLM why a candidate failed:Pareto-Efficient Search
GEPA maintains a frontier of candidates, preserving any that excel on specific examples:- Agent A: 95% on rotation puzzles, 60% on color mapping
- Agent B: 70% on rotation puzzles, 90% on color mapping
Seedless Mode
Don’t know where to start? Useseed_candidate=None:
Advanced Examples
Multi-Task Agent Evolution
Optimize across multiple related tasks:Coding Agent Skills
Optimize repository-specific instructions for coding agents:Cloud Scheduling Policies
Discover algorithms that generalize across infrastructure scenarios:Best Practices
Start with a working baseline
Start with a working baseline
Even a naive 10-line agent is better than starting from scratch. It gives GEPA:
- A valid code structure to modify
- Baseline performance to beat
- Syntax examples for the domain
seed_candidate=None for seedless mode.Return rich diagnostic feedback
Return rich diagnostic feedback
The quality of ASI directly impacts optimization effectiveness:
- Good: Error messages, failed test cases, execution traces
- Better: Structured diagnostics showing what went wrong
- Best: Visual feedback (rendered outputs) for vision models
oa.log() liberally in your evaluator.Use Pareto-aware minibatching
Use Pareto-aware minibatching
Set
reflection_minibatch_size=2-5 to focus each iteration:- LLM sees 2-5 examples per reflection
- Makes targeted improvements for those cases
- Pareto frontier preserves specialized gains
- Over iterations, all examples get attention
Allocate sufficient budget
Allocate sufficient budget
Agent architecture evolution needs more iterations than prompt optimization:
- Quick test: 20-50 iterations
- Good results: 100-200 iterations
- Publication quality: 300-500 iterations
Troubleshooting
Code execution errors
Code execution errors
Common issues:
- Import errors: Include necessary imports in seed
- Syntax errors: GEPA will fix these if you log them as ASI
- Timeout: Add execution timeout in evaluator
No improvement after many iterations
No improvement after many iterations
Possible causes:
- Poor ASI: Make diagnostics more informative
- Weak reflection model: Try GPT-4o or o1
- Insufficient examples: Add more diverse training data
- Wrong objective: Clarify what you want in
objectiveparameter
Next Steps
optimize_anything API
Complete API reference with all parameters
Blog Post
Detailed blog post with 8 case studies
Coding Agent Skills
Learn how to optimize skills for coding agents
GEPA Paper
Research paper with methodology and results
Learn More
- ARC-AGI Example: Full code for the agent evolution demo
- CUDA Kernels: Multi-task optimization generating fast GPU code
- Cloud Policies: CloudCast and Can’t Be Late scheduling algorithms
- 3D Unicorn: Seedless mode example generating 3D models from scratch