Skip to main content
The context-bench memory subcommand evaluates systems that maintain state across a conversation: they ingest a conversation history, then answer questions about it. Unlike the standard proxy benchmark, memory systems are stateful — each conversation gets a fresh reset before ingestion.

What memory systems are

A memory system manages what the LLM remembers from a long conversation. Instead of stuffing the entire history into the context window, a memory system may compress, index, or selectively retrieve facts before answering each question. context-bench measures F1 against ground-truth answers and tracks how many tokens each system uses per query.

The MemorySystem protocol

Implement these three methods — no subclassing needed:
from typing import Any, Protocol

class MemorySystem(Protocol):
    @property
    def name(self) -> str: ...

    def reset(self) -> None:
        """Clear all stored memory. Called once before each conversation."""
        ...

    def ingest(self, turns: list[dict[str, Any]]) -> None:
        """Store a conversation history.

        Args:
            turns: List of {"role": "user"|"assistant", "content": str} dicts
                   in chronological order.
        """
        ...

    def query(self, question: str) -> str:
        """Answer a question using stored memory.

        Returns:
            The system's answer string.
        """
        ...
The runner calls reset() before each conversation, ingest() to load the history, and then query() for each QA pair in that conversation.

Built-in memory systems

NameDescription
naiveStuffs the full conversation history into the prompt. Baseline — every system claims to beat this.
embeddingEmbeds turns with all-MiniLM-L6-v2 and retrieves the top-k most relevant chunks via cosine similarity.
rlmDSPy-based retrieval with LanceDB vector storage and DuckDB for structured queries.
mem0Uses the mem0ai package for memory management. Requires pip install context-bench[mem0].
zepUses graphiti-core for temporal knowledge graph memory. Requires pip install context-bench[zep].

Available datasets

NameDescription
locomoLong-conversation memory benchmark. Includes single_hop, multi_hop, temporal, and open_domain QA types.
longmemevalLong-memory evaluation with variants s, m, and oracle.

CLI examples

1

Single system

Evaluate the naive baseline on LoCoMo:
context-bench memory \
  --system naive \
  --relay http://localhost:7878 \
  --dataset locomo
2

Compare multiple systems

Run naive and embedding side by side:
context-bench memory \
  --system naive \
  --system embedding \
  --relay http://localhost:7878 \
  --dataset locomo -n 20
3

Filter QA types

Evaluate only temporal and multi-hop questions:
context-bench memory \
  --system naive \
  --relay http://localhost:7878 \
  --qa-types temporal,multi_hop
4

Limit conversations

Cap the number of conversations to evaluate:
context-bench memory \
  --system naive \
  --system embedding \
  --relay http://localhost:7878 \
  -n 20

Output

The table output includes both a top-level summary and a per-QA-type breakdown:
| System    | mean_score |
|-----------|------------|
| naive     | 0.4120     |
| embedding | 0.5340     |

## Per-QA-Type Breakdown

| System    | multi_hop | open_domain | single_hop | temporal |
|-----------|-----------|-------------|------------|----------|
| naive     | 0.3210    | 0.4800      | 0.5100     | 0.3670   |
| embedding | 0.4450    | 0.6020      | 0.6310     | 0.4120   |
The per-QA-type breakdown is automatically enabled when LoCoMo data is loaded.

CLI reference

FlagDefaultDescription
--system(required)Memory system to evaluate. Repeatable. One of: naive, embedding, rlm, mem0, zep.
--relay(required)OpenAI-compatible relay URL for LLM calls.
--datasetlocomoDataset to evaluate on: locomo or longmemeval. Repeatable.
--modelclaude-haiku-4-5-20251001Model name.
-nallMax conversations to evaluate.
--qa-typesallComma-separated QA types to filter, e.g. temporal,multi_hop.
--outputtableOutput format: table, json, or html.
--score-fieldf1Score field to report.
--seed42Random seed for dataset splits.

Optional dependencies

mem0 and zep require extra packages. Install them before use:
pip install "context-bench[mem0]"   # for --system mem0
pip install "context-bench[zep]"    # for --system zep

Python API

Use evaluate_memory() directly for programmatic access:
from context_bench.memory_runner import evaluate_memory
from context_bench.systems.naive import NaiveSystem
from context_bench.systems.embedding import EmbeddingSystem
from context_bench.evaluators import AnswerQuality
from context_bench.metrics.per_qa_type import PerQATypeMetric
from context_bench.datasets.memory import locomo

systems = [
    NaiveSystem(base_url="http://localhost:7878", model="claude-haiku-4-5-20251001"),
    EmbeddingSystem(base_url="http://localhost:7878", model="claude-haiku-4-5-20251001"),
]

dataset = locomo(n=20, qa_types=["temporal", "multi_hop"])

result = evaluate_memory(
    systems=systems,
    dataset=dataset,
    evaluators=[AnswerQuality()],
    metrics=[PerQATypeMetric(score_field="f1")],
    max_examples=20,
)

print(result.summary)

Build docs developers (and LLMs) love