Evolution is the mechanism; adaptiveness is the result
This is closer to how biological evolution works than how learning works. A species doesn’t “learn” to survive winter — individuals that happen to have thicker fur survive, and that trait gets selected for. Similarly, agent versions that handle more edge cases correctly survive in production, and the patterns that made them successful get carried forward. The practical implication: don’t expect evolution to make an agent smarter about problems it’s never seen. Evolution improves reliability on the kinds of problems the agent has already encountered. For genuinely novel situations, that’s what human-in-the-loop is for — and every time a human steps in, that interaction becomes potential fuel for the next evolution cycle.How it works
The evolution loop has four stages:1. Execute
The worker agent runs against real inputs. Sessions produce outcomes, decisions, and metrics.core/framework/graph/executor.py
2. Evaluate
The framework checks outcomes against the goal’s success criteria and constraints. Did the agent produce the desired result? Which criteria were satisfied and which weren’t? Were any constraints violated?core/framework/graph/goal.py
3. Diagnose
Failure data is structured and specific. It’s not just “the agent failed” — it’s “nodedraft_message failed to produce personalized content because the research node returned insufficient data about the prospect’s recent activity.”
The decision log, problem reports, and execution trace provide the full picture:
- Which node failed
- What it was trying to do (goal context)
- How many retries it attempted
- What the judge feedback said on each retry
- What data was in shared memory at the time
- What constraints were violated
- What success criteria weren’t met
4. Regenerate
A coding agent receives the diagnosis and the current agent code. It modifies the graph — adding nodes, adjusting prompts, changing edge conditions, adding tools — to address the specific failure. The new version is deployed and the cycle restarts.core/framework/graph/goal.py
What gets evolved
Evolution can change almost anything about an agent:Prompts
The most common fix. A node’s system prompt gets refined based on the specific ways the LLM misunderstood its instructions.examples/templates/deep_research_agent/nodes/__init__.py
Graph structure
Adding a validation node before a critical step, splitting a node that’s trying to do too much, adding a fallback path for a common failure mode.Edge conditions
Adjusting routing logic based on observed patterns. If low-confidence research results consistently lead to bad drafts, add a conditional edge that routes them back for another research pass.examples/templates/deep_research_agent/agent.py
Tool selection
Swapping in a better tool, adding a new one, or removing one that causes more problems than it solves.Constraints and criteria
Tightening or loosening based on what’s actually achievable and what matters in practice.The role of decision logging
Evolution depends on good data. The runtime captures every decision an agent makes: what it was trying to do, what options it considered, what it chose, and what happened as a result. This isn’t overhead — it’s the signal that makes evolution possible. Without decision logging, failure analysis is guesswork. With it, the coding agent can trace a failure back to its root cause and make a targeted fix rather than a blind change.core/framework/graph/event_loop_node.py
- Iteration number: How many attempts before acceptance/failure
- Judge feedback: What was wrong on each retry
- Tool calls: What the agent tried
- Memory state: What data it had access to
- Success criteria status: Which criteria were/weren’t met
Evolution in practice
Here’s what a typical evolution cycle looks like:Generation 1: Happy path
- Search results don’t contain enough detail
- URLs in search results are dead links
- Content needs deeper analysis
Generation 2: Add tools and specificity
- Agent tries to scrape every URL, including broken ones
- Gets stuck in retry loops on 404 errors
Generation 3: Add error handling
When to trigger evolution
Not every failure needs immediate evolution. Consider evolution when:Systematic failures
Systematic failures
The same failure pattern appears across multiple sessions (e.g., “research node always fails when topic is technical”).
High-frequency issues
High-frequency issues
Failures that block >20% of sessions in a meaningful way.
Constraint violations
Constraint violations
Hard constraints being violated (safety, cost, compliance issues).
Performance degradation
Performance degradation
Success rate dropping over time as input distribution shifts.
- One-off failures from bad input data
- Transient issues (API downtime, rate limits)
- Failures that humans should handle (genuinely novel situations)
Evolution vs learning
Evolution is different from in-session learning:| Aspect | Evolution | In-Session Learning |
|---|---|---|
| Scope | Across sessions | Within a session |
| Mechanism | Code rewrite | Reflexion/retry |
| Speed | Slow (between sessions) | Fast (real-time) |
| Persistence | Permanent (new version) | Temporary (session only) |
| Target | Recurring patterns | One-off issues |