context-bench memory subcommand evaluates systems that maintain state across a conversation: they ingest a conversation history, then answer questions about it. Unlike the standard proxy benchmark, memory systems are stateful — each conversation gets a fresh reset before ingestion.
What memory systems are
A memory system manages what the LLM remembers from a long conversation. Instead of stuffing the entire history into the context window, a memory system may compress, index, or selectively retrieve facts before answering each question. context-bench measures F1 against ground-truth answers and tracks how many tokens each system uses per query.The MemorySystem protocol
Implement these three methods — no subclassing needed:reset() before each conversation, ingest() to load the history, and then query() for each QA pair in that conversation.
Built-in memory systems
| Name | Description |
|---|---|
naive | Stuffs the full conversation history into the prompt. Baseline — every system claims to beat this. |
embedding | Embeds turns with all-MiniLM-L6-v2 and retrieves the top-k most relevant chunks via cosine similarity. |
rlm | DSPy-based retrieval with LanceDB vector storage and DuckDB for structured queries. |
mem0 | Uses the mem0ai package for memory management. Requires pip install context-bench[mem0]. |
zep | Uses graphiti-core for temporal knowledge graph memory. Requires pip install context-bench[zep]. |
Available datasets
| Name | Description |
|---|---|
locomo | Long-conversation memory benchmark. Includes single_hop, multi_hop, temporal, and open_domain QA types. |
longmemeval | Long-memory evaluation with variants s, m, and oracle. |
CLI examples
Output
The table output includes both a top-level summary and a per-QA-type breakdown:CLI reference
| Flag | Default | Description |
|---|---|---|
--system | (required) | Memory system to evaluate. Repeatable. One of: naive, embedding, rlm, mem0, zep. |
--relay | (required) | OpenAI-compatible relay URL for LLM calls. |
--dataset | locomo | Dataset to evaluate on: locomo or longmemeval. Repeatable. |
--model | claude-haiku-4-5-20251001 | Model name. |
-n | all | Max conversations to evaluate. |
--qa-types | all | Comma-separated QA types to filter, e.g. temporal,multi_hop. |
--output | table | Output format: table, json, or html. |
--score-field | f1 | Score field to report. |
--seed | 42 | Random seed for dataset splits. |
Optional dependencies
Python API
Useevaluate_memory() directly for programmatic access:
