Skip to main content
The autoresearch loop uses an LLM (Claude Sonnet) to propose, evaluate, and accept or reject architectural improvements to a context pipeline — overnight, unattended. Each iteration either advances the best-known pipeline or reverts to it, building a checkpoint log you can resume at any time. SOTA target: 90% F1 (Hindsight + Gemini-3 Pro + TEMPR, Dec 2025)

How it works

1

Load the starting pipeline

The loop starts from a baseline pipeline file (sota_pipeline.py or any file you point --pipeline-path at). This file must define a ContextPipeline class.
2

Evaluate the baseline

The baseline is evaluated on a fixed random sample of --eval-n conversations from the LoCoMo training split. This sample is reused for every iteration so accept/reject decisions are comparable.
3

Propose a mutation

The mutation LLM reads the current pipeline source and the full score history, then proposes one architectural improvement — coreference resolution, temporal ranking, embedding fallback, etc.
4

Evaluate the candidate

The candidate pipeline is imported dynamically and scored on the same fixed sample. If it fails to import or raises an error, it is rejected automatically.
5

Accept or reject

If the candidate’s score exceeds the current best, the candidate becomes the new baseline and is saved to disk. Otherwise the loop reverts to the previous best and tries again.
6

Checkpoint and repeat

After every iteration, the best pipeline and the full log are written to --output-dir. The loop runs for --iterations steps, then runs a final held-out evaluation comparing the best discovered pipeline against the naive baseline.

Running the loop

Prerequisites

You need an OpenAI-compatible relay (e.g. claude-relay) running before you start the loop.
# Start the relay
cd /path/to/claude-relay
uv run agent-relay serve --port 18082 --max-concurrent 16

Start the loop

uv run python3 loop.py \
  --relay http://localhost:18082 \
  --model claude-haiku-4-5-20251001 \
  --dataset locomo \
  --iterations 50 \
  --eval-n 2 \
  --max-qa-per-conv 5 \
  --seed 42 \
  --output-dir loop_results_v6/ \
  --pipeline-path src/context_bench/pipeline/sota_pipeline.py

Resume after a crash

uv run python3 loop.py \
  --relay http://localhost:18082 \
  --output-dir loop_results_v6/ \
  --pipeline-path src/context_bench/pipeline/sota_pipeline.py \
  --resume
With --resume, the loop loads best_pipeline.py and loop_log.jsonl from --output-dir and skips re-evaluating the baseline (it reads baseline_score.json instead).

CLI flags

FlagDefaultDescription
--relayhttp://localhost:18082OpenAI-compatible relay URL for evaluation and answer calls.
--modelclaude-haiku-4-5-20251001Model for evaluation/answer calls.
--mutation-modelclaude-sonnet-4-6Stronger model used for proposing mutations.
--datasetlocomoDataset to optimize on. One of: locomo, longmemeval_s, longmemeval_m, longmemeval_oracle.
--iterations200Number of mutation/evaluation iterations.
--eval-n30Conversations sampled per evaluation iteration.
--max-qa-per-conv10QA pairs sampled per conversation (caps large LoCoMo evals).
--seed42RNG seed for the train/test split.
--output-dir./loop_resultsDirectory for checkpoints, logs, and results.
--pipeline-pathrepo copyPath to the starting context_pipeline.py.
--resumefalseLoad the last checkpoint from --output-dir and continue.
--tool-relaysame as --relaySeparate relay for the LLM judge and mutator.
--judge-modelhaikuModel for the LLM judge in the final evaluation.
--api-key$OPENAI_API_KEYBearer token for the relay.

Overnight watchdog

Use the bundled watchdog script to keep the loop running unattended. It checks every 10 minutes and restarts the loop with --resume if it crashes or stalls for more than 150 minutes.
bash watchdog6.sh
The watchdog starts both the relay and the loop, writing output to watchdog.log, relay.log, and run.log.

Output files

FileContents
run.logFull loop stdout: progress lines, scores, mutation summaries.
loop_log.jsonlOne JSON object per iteration: score, best_score, delta, mutation, accepted, elapsed_s.
baseline_score.jsonCached baseline score. Read on --resume to skip re-evaluation.
best_pipeline.pyThe current best pipeline. Updated after every accepted mutation.
final_results.jsonFinal head-to-head: best discovered pipeline vs. naive baseline on the held-out test set.
watchdog.logWatchdog heartbeat and restart events.
relay.logRelay server stdout.

sota_pipeline.py: starting point techniques

The sota_pipeline.py file is the recommended starting baseline. It implements:
Pronouns and aliases are resolved to canonical entity names during ingestion. “She” becomes “Alice”, “the company” becomes “Acme Corp”. This dramatically improves entity-level retrieval precision.
Facts are stored as (entity, relation, value, turn_idx) tuples. For example: ("Alice", "works_at", "Acme Corp", 12). This allows precise lookups that raw text search cannot match.
Each entity accumulates all known facts, sorted by recency. A single lookup retrieves everything the system knows about a person or concept.
Multi-part questions are split into sub-questions before retrieval. “What did Alice study, and where does she work now?” becomes two separate retrievals.
Entities extracted from a question are used to retrieve their profiles, which are then reasoned over together. Supports chains like Alice → employer → location.
Questions containing “current” or “now” prefer recent facts. Questions containing “used to” or “before” prefer older facts. Turn index is used as the recency signal.
When entity-lookup retrieval returns too few results, all-MiniLM-L6-v2 semantic search fills the gap. Prevents silent failures on out-of-vocabulary questions.
Start with a smaller --eval-n (2–5 conversations) and --max-qa-per-conv (5) to iterate quickly. Increase these once you have a promising direction to get a reliable signal.

Build docs developers (and LLMs) love