Skip to main content
The memory subcommand evaluates stateful memory systems on long-conversation QA datasets. It measures how well each system retains and retrieves information across turns, scoring answers against ground truth with F1, exact match, and per-QA-type breakdowns.

Usage

context-bench memory --system NAME --relay URL [options]

Flags

--system
string
required
Memory system to evaluate. Repeatable — pass multiple --system flags to compare systems in one run. At least one is required.Available systems: naive, embedding, rlm, mem0, zep.
context-bench memory --system naive --system mem0 --relay http://localhost:7878
--relay
string
required
OpenAI-compatible relay URL. All memory systems use this endpoint to call the language model.
--dataset
string
default:"locomo"
Memory dataset to evaluate on. Repeatable. Accepts locomo or longmemeval.When omitted, defaults to locomo. Pass both to merge the datasets:
context-bench memory --system naive --relay http://localhost:7878 \
  --dataset locomo --dataset longmemeval
--model
string
default:"claude-haiku-4-5-20251001"
Model name used for answering questions and running the MemoryJudge evaluator.
--api-key
string
Bearer token for the relay. When omitted, falls back to the OPENAI_API_KEY environment variable.
-n
integer
default:"all"
Maximum number of conversations to evaluate. Useful for quick runs.
--qa-types
string
Comma-separated list of QA types to filter. When omitted, all QA types are included.Example types: temporal, multi_hop, single_hop, adversarial.
--qa-types temporal,multi_hop
--output
string
default:"table"
Output format. One of table, json, or html.
--score-field
string
default:"f1"
Score field to report as the primary metric. Also controls which field is used in the per-QA-type breakdown table.
--seed
integer
default:"42"
Random seed for dataset splits. Set this to reproduce results across runs.

Memory systems

SystemDescriptionExtra dependency
naiveConcatenates all conversation turns into the context windowNone
embeddingEmbeds turns and retrieves the top-k most relevant at query timeNone
rlmRetrieval-augmented language model using LanceDB and sentence-transformerspip install dspy lancedb duckdb sentence-transformers
mem0Integrates the mem0ai memory layerpip install context-bench[mem0]
zepIntegrates Zep’s graphiti-core knowledge graph memorypip install context-bench[zep]
The mem0 and zep systems require optional dependencies. Install them before use or the CLI will exit with an installation hint.
pip install context-bench[mem0]
pip install context-bench[zep]

Available datasets

NameDescription
locomoLoCoMo — long-conversation QA with temporal and multi-hop questions
longmemevalLongMemEval — long-form memory evaluation benchmark

Examples

context-bench memory --system naive --relay http://localhost:7878
The memory subcommand automatically adds a MemoryJudge evaluator alongside AnswerQuality, and computes per-QA-type score breakdowns in the output table.

Build docs developers (and LLMs) love