Overview
The TerminusAdapter enables GEPA to optimize the Terminus terminal-use agent’s system prompt on TerminalBench tasks. It integrates with the TerminalBench evaluation framework to:
- Run agent episodes on terminal tasks
- Capture execution traces and failures
- Provide feedback for instruction improvement
- Optimize the agent’s instruction prompt
Installation
pip install gepa
pip install terminal-bench
Quick Start
import gepa
from gepa.adapters.terminal_bench_adapter import TerminusAdapter, TerminalBenchTask
# Prepare dataset
train_tasks = [
TerminalBenchTask(
task_id="core-001",
model_name="openai/gpt-4o-mini"
),
# ... more tasks
]
# Create adapter
adapter = TerminusAdapter(
n_concurrent=6,
instruction_prompt_path="prompt-templates/instruction_prompt.txt"
)
# Optimize
result = gepa.optimize(
seed_candidate={
"instruction_prompt": "You are a helpful terminal assistant. Complete the task."
},
trainset=train_tasks[:10],
valset=train_tasks[10:],
adapter=adapter,
max_metric_calls=50,
reflection_lm="openai/gpt-4"
)
print("Optimized instruction prompt:")
print(result.best_candidate["instruction_prompt"])
Class Signature
Defined in src/gepa/adapters/terminal_bench_adapter/terminal_bench_adapter.py:140:
class TerminusAdapter(GEPAAdapter):
def __init__(
self,
n_concurrent: int = 6,
instruction_prompt_path: str = "prompt-templates/instruction_prompt.txt",
)
Parameters
Number of concurrent tasks to run in parallel during evaluation.
instruction_prompt_path
str
default:"'prompt-templates/instruction_prompt.txt'"
Path to file where instruction prompt is written for TerminalBench to read.
Data Types
TerminalBenchTask
Input data structure (src/gepa/adapters/terminal_bench_adapter/terminal_bench_adapter.py:13):
class TerminalBenchTask(BaseModel):
task_id: str # TerminalBench task ID (e.g., "core-001")
model_name: str # Model to use for the agent
Trajectory
Execution trace (returned in EvaluationBatch.trajectories):
{
"messages": list[dict], # Message history from episode
"instruction_prompt": str, # Instruction prompt used
"failed_reason": str, # Reason for failure (if any)
"success": bool # Whether task was solved
}
Methods
evaluate()
Runs TerminalBench evaluation on batch of tasks.
def evaluate(
self,
batch: list[TerminalBenchTask],
candidate: dict[str, str],
capture_traces: bool = False,
) -> EvaluationBatch
Implementation: src/gepa/adapters/terminal_bench_adapter/terminal_bench_adapter.py:149
Behavior
- Writes
candidate["instruction_prompt"] to file
- Runs
tb run command with task IDs
- For each task:
- Reads results from run directory
- Extracts score (number of passed parser checks)
- Extracts success status
- Captures message history from last episode
- Returns
EvaluationBatch with scores and trajectories
Scoring
- Success:
is_resolved = True in results
- Score: Number of parser checks passed (subset of task requirements)
- Failure: Score = 0 if task not resolved
make_reflective_dataset()
Generates reflective dataset from evaluation results.
def make_reflective_dataset(
self,
candidate: dict[str, str],
eval_batch: EvaluationBatch,
components_to_update: list[str],
) -> dict[str, list[dict[str, Any]]]
Implementation: src/gepa/adapters/terminal_bench_adapter/terminal_bench_adapter.py:198
Returns
{
"instruction_prompt": [
{
"Message History": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
],
"Instruction Prompt": "You are a helpful terminal assistant...",
"Feedback": "Successfully solved the task!" # or failure reason
},
# ... more examples
]
}
TerminalBench Integration
Directory Structure
The adapter expects this structure:
.
├── prompt-templates/
│ └── instruction_prompt.txt # Written by adapter
├── runs/
│ └── temp_gepa_run_*/
│ └── core-001_*/
│ ├── results.json # Task results
│ └── agent-logs/
│ └── episode-*/
│ ├── debug.json # Message history
│ └── response.json # Agent response
└── train_terminus.py # TerminusWrapper implementation
TerminusWrapper
You need to provide a TerminusWrapper class that:
- Reads instruction prompt from file
- Wraps the Terminus agent
- Uses the instruction prompt in agent configuration
Example train_terminus.py:
from terminal_bench.agents.terminus_1 import Terminus1Agent
from pathlib import Path
class TerminusWrapper:
def __init__(self, model_name: str, **kwargs):
# Read instruction prompt from file
prompt_path = Path("prompt-templates/instruction_prompt.txt")
if prompt_path.exists():
instruction_prompt = prompt_path.read_text()
else:
instruction_prompt = "Default instruction prompt"
# Create agent with custom instruction
self.agent = Terminus1Agent(
model_name=model_name,
instruction_prompt=instruction_prompt,
**kwargs
)
def __call__(self, *args, **kwargs):
return self.agent(*args, **kwargs)
Running TerminalBench
The adapter runs this command internally:
tb run \
--dataset-name terminal-bench-core \
--dataset-version head \
--agent-import-path train_terminus:TerminusWrapper \
--model-name openai/gpt-4o-mini \
--run-id temp_gepa_run_20240315120000 \
--n-concurrent 6 \
--output-path ./runs \
--task-id core-001 \
--task-id core-002
Complete Example
import gepa
from gepa.adapters.terminal_bench_adapter import TerminusAdapter, TerminalBenchTask
from pathlib import Path
# 1. Setup directory structure
Path("prompt-templates").mkdir(exist_ok=True)
# 2. Create seed prompt
seed_prompt = """
You are a helpful terminal assistant that helps users complete tasks.
Follow these guidelines:
- Read the task carefully
- Break down complex tasks into steps
- Execute commands thoughtfully
- Verify results before concluding
Complete the task successfully.
"""
# 3. Prepare dataset
# Use core tasks from TerminalBench
train_tasks = [
TerminalBenchTask(task_id="core-001", model_name="openai/gpt-4o-mini"),
TerminalBenchTask(task_id="core-002", model_name="openai/gpt-4o-mini"),
TerminalBenchTask(task_id="core-003", model_name="openai/gpt-4o-mini"),
# ... add more tasks
]
val_tasks = [
TerminalBenchTask(task_id="core-010", model_name="openai/gpt-4o-mini"),
TerminalBenchTask(task_id="core-011", model_name="openai/gpt-4o-mini"),
]
# 4. Create adapter
adapter = TerminusAdapter(
n_concurrent=4, # Run 4 tasks in parallel
instruction_prompt_path="prompt-templates/instruction_prompt.txt"
)
# 5. Optimize
result = gepa.optimize(
seed_candidate={"instruction_prompt": seed_prompt},
trainset=train_tasks,
valset=val_tasks,
adapter=adapter,
max_metric_calls=30, # TerminalBench is slow, use smaller budget
reflection_lm="openai/gpt-4"
)
# 6. Save optimized prompt
optimized_prompt = result.best_candidate["instruction_prompt"]
Path("prompt-templates/optimal_prompt.txt").write_text(optimized_prompt)
print("Optimized prompt saved to prompt-templates/optimal_prompt.txt")
print(f"Validation score: {result.best_score}")
Best Practices
1. Task Selection
Start with simpler core tasks:
# Easy tasks for initial optimization
train_tasks = [
TerminalBenchTask(task_id="core-001", model_name="openai/gpt-4o-mini"),
TerminalBenchTask(task_id="core-002", model_name="openai/gpt-4o-mini"),
]
# Harder tasks for validation
val_tasks = [
TerminalBenchTask(task_id="core-050", model_name="openai/gpt-4o-mini"),
TerminalBenchTask(task_id="core-051", model_name="openai/gpt-4o-mini"),
]
2. Concurrency
Balance parallelism with resources:
# High concurrency (requires good hardware)
adapter = TerminusAdapter(n_concurrent=12)
# Conservative (more stable)
adapter = TerminusAdapter(n_concurrent=4)
3. Budget
TerminalBench is slow (1-2 minutes per task):
# For 10 tasks, 30 calls = ~5-10 hours
result = gepa.optimize(
seed_candidate={"instruction_prompt": seed_prompt},
trainset=train_tasks,
valset=val_tasks,
adapter=adapter,
max_metric_calls=30 # Keep budget modest
)
4. Seed Prompt
Provide a structured seed prompt:
seed_prompt = """
You are an expert terminal assistant.
## Task Understanding
- Read the task description carefully
- Identify the goal and constraints
- Plan your approach before acting
## Execution
- Use appropriate commands
- Check command outputs
- Handle errors gracefully
## Verification
- Verify task completion
- Check for edge cases
- Confirm success criteria met
"""
5. Monitoring
Check run logs for debugging:
# After optimization, inspect logs
import json
from pathlib import Path
# Find latest run
run_dirs = sorted(Path("runs").glob("temp_gepa_run_*"))
latest_run = run_dirs[-1]
# Read results
for task_dir in latest_run.glob("core-*"):
results_file = task_dir / "results.json"
if results_file.exists():
results = json.loads(results_file.read_text())
print(f"{task_dir.name}:")
print(f" Resolved: {results.get('is_resolved')}")
print(f" Failure: {results.get('failure_mode')}")
Execution Time
- Per task: 1-2 minutes (agent episodes + evaluation)
- Batch of 10 tasks (n_concurrent=6): ~2-3 minutes
- Full optimization (30 calls, 10 tasks): ~60-90 minutes
Cost Optimization
Use cheaper models for task execution:
train_tasks = [
TerminalBenchTask(task_id="core-001", model_name="openai/gpt-4o-mini"), # Cheaper
# vs
# TerminalBenchTask(task_id="core-001", model_name="openai/gpt-4"), # Expensive
]
# Use stronger model only for reflection
result = gepa.optimize(
...,
reflection_lm="openai/gpt-4" # Only for instruction proposal
)
Disk Space
Each run creates logs:
# Clean up old runs
rm -rf runs/temp_gepa_run_*
Limitations
- Slow evaluation: TerminalBench tasks take 1-2 minutes each
- Single component: Only optimizes instruction prompt
- Task dependencies: Requires TerminalBench infrastructure
- File-based communication: Uses file system for prompt passing
- No streaming feedback: Cannot observe agent during execution
Troubleshooting
Task Not Running
Check TerminusWrapper implementation:
# Test wrapper directly
from train_terminus import TerminusWrapper
agent = TerminusWrapper(model_name="openai/gpt-4o-mini")
print("Agent created successfully")
Results Not Found
Check run directory structure:
from pathlib import Path
# List run directories
for run_dir in Path("runs").iterdir():
print(f"Run: {run_dir.name}")
for task_dir in run_dir.iterdir():
print(f" Task: {task_dir.name}")
print(f" Has results.json: {(task_dir / 'results.json').exists()}")
Low Scores
Check if tasks are too difficult:
# Test baseline performance
from gepa.adapters.terminal_bench_adapter import run_agent_tb, get_results
task_id = "core-001"
run_id = "baseline_test"
run_agent_tb(
task_ids=task_id,
run_id=run_id,
model_name="openai/gpt-4o-mini",
instruction_prompt="You are a helpful assistant.",
n_concurrent=1
)
success, score, reason, messages = get_results(task_id, run_id)
print(f"Success: {success}")
print(f"Score: {score}")
print(f"Reason: {reason}")
Advanced Usage
Custom Dataset
Use custom TerminalBench tasks:
train_tasks = [
TerminalBenchTask(
task_id="custom-task-001",
model_name="openai/gpt-4o-mini"
),
]
# Requires custom task in TerminalBench dataset
Different Models
Optimize for different models:
# Optimize GPT-4o-mini
gpt4_tasks = [
TerminalBenchTask(task_id=tid, model_name="openai/gpt-4o-mini")
for tid in ["core-001", "core-002"]
]
# Optimize Claude
claude_tasks = [
TerminalBenchTask(task_id=tid, model_name="anthropic/claude-3-5-sonnet-20241022")
for tid in ["core-001", "core-002"]
]
Ensemble Prompts
Create prompts for different task types:
# Optimize different prompts for different task categories
file_tasks = [t for t in train_tasks if "file" in t.task_id]
network_tasks = [t for t in train_tasks if "network" in t.task_id]
# Optimize file prompt
file_result = gepa.optimize(
seed_candidate={"instruction_prompt": file_seed_prompt},
trainset=file_tasks,
valset=val_tasks,
adapter=adapter,
max_metric_calls=20
)
# Optimize network prompt
network_result = gepa.optimize(
seed_candidate={"instruction_prompt": network_seed_prompt},
trainset=network_tasks,
valset=val_tasks,
adapter=adapter,
max_metric_calls=20
)
See Also