Key Result: ARC-AGI
Test Accuracy
32.5% → 89.5% on ARC-AGI v1 public test (+57 pp)
Validation Accuracy
56.5% → 93.5% on validation set
Model Used
Gemini 3 Flash — improvements via architecture, not model size
Cost Efficiency
2x cost per task vs naive agent, 3x accuracy improvement
What Gets Optimized?
Unlike prompt optimization which tunes instructions, agent architecture discovery optimizes:- System architecture: Multi-stage pipelines, sub-agent orchestration
- Code implementations: Helper functions, validation logic, fallback strategies
- Control flow: Retry mechanisms, iterative refinement, branching logic
- Prompts: Instructions for each sub-agent
- Error handling: Recovery strategies and graceful degradation
ARC-AGI Case Study
The Challenge
ARC-AGI tests abstract reasoning through visual grid transformation puzzles. It requires:- Pattern recognition
- Rule induction from few examples
- Generating executable code to transform grids
- Validation and iterative refinement
Initial Agent (Seed)
A naive 10-line agent:Evolved Agent (300+ lines)
After GEPA optimization:Architecture Evolution
Iteration 0-20: Rule Induction
GEPA discovers that explicitly extracting transformation rules improves code generation accuracy.
Iteration 20-50: Validation Loop
Adds iterative validation on training examples with targeted refinement of failing code.
Iteration 50-80: Structured Fallbacks
Introduces graceful degradation: when code execution fails, fall back to direct LLM prediction.
Optimization Trajectory
The graph shows validation accuracy improving from 56.5% to 93.5% over 100 metric calls, with test accuracy reaching 89.5%. Key inflection points:- Metric call 20: Validation jumps from 56% → 72% when rule induction is added
- Metric call 50: Reaches 85% with validation loop
- Metric call 80: Breaks 90% with structured fallbacks
How It Works
1. Evaluator
The evaluator runs the agent on ARC-AGI puzzles and returns detailed diagnostics:2. Optimization Call
3. Reflection Process
During reflection, GEPA shows the LLM:- Current agent code
- Execution results on a minibatch (3 tasks)
- Failures: incorrect predictions, errors, timeouts
- Successes: what worked and why
Example Reflection Output
Example Reflection Output
Other Agent Architecture Examples
Multi-Agent RAG System
From the healthcare RAG case study:Terminal-Use Agent (Terminus)
GEPA optimizes the system prompt for the Terminus terminal-use agent:Production Incident Diagnosis
Arc.computer’s ATLAS system uses GEPA-optimized agents for production incident diagnosis:Root Cause Analysis
Automated RCA for production incidents
Dynamic Data Collection
Collects logs, metrics, and traces on-demand
RL Augmentation
+142% student performance when RL-tuned teacher is improved with GEPA
Reduced On-Call Burden
Less manual work for on-call engineers
- Start with RL-tuned teacher model
- Apply GEPA to optimize teacher’s prompts/architecture
- Train student model from improved teacher
- Result: +142% improvement over RL-tuned baseline
Advantages of Architecture Discovery
Automates Design Iteration
No manual architecture search — GEPA explores the design space
Discovers Non-Obvious Patterns
Finds strategies humans might miss (e.g., multi-stage validation)
Task-Specific Optimization
Architecture adapts to the specific domain (ARC-AGI vs terminal use vs RAG)
Interpretable
Full agent code is readable — understand why it works
Best Practices
Start with a working seed agent
Start with a working seed agent
Even a naive baseline (10 lines) is enough. GEPA will evolve complexity as needed.
Provide detailed execution feedback
Provide detailed execution feedback
Include error messages, intermediate results, timing info — anything that helps diagnose failures.
Use a validation set
Use a validation set
Essential for generalization mode. Prevents overfitting to training tasks.
Set appropriate reflection minibatch size
Set appropriate reflection minibatch size
Default is 2-3 tasks per reflection. Increase for diverse task distributions, decrease for similar tasks.
Leverage background knowledge
Leverage background knowledge
Provide domain context in the
background parameter to guide evolution toward good architectures.Monitor intermediate candidates
Monitor intermediate candidates
Check validation scores during optimization to catch overfitting early.
Seedless Architecture Discovery
Don’t have a starting agent? Use seedless mode:Comparison: Prompt Optimization vs Architecture Discovery
| Aspect | Prompt Optimization | Architecture Discovery |
|---|---|---|
| What’s optimized | Instructions/prompts | Complete agent code |
| Typical size | 100-500 tokens | 100-500 lines of code |
| Structural changes | No | Yes — control flow, functions, sub-agents |
| Complexity growth | Prompt elaboration | Architectural evolution |
| Example | AIME math prompt | ARC-AGI agent system |
| Speedup over RL | 35x | 35x |
| Typical accuracy gains | 10-20 pp | 30-60 pp |
Next Steps
Try the ARC-AGI Tutorial
Step-by-step agent architecture optimization
RAG Optimization
Optimize retrieval pipelines
Code Optimization
Generate and optimize code
API Reference
Complete API documentation