memory subcommand evaluates stateful memory systems on long-conversation QA datasets. It measures how well each system retains and retrieves information across turns, scoring answers against ground truth with F1, exact match, and per-QA-type breakdowns.
Usage
Flags
Memory system to evaluate. Repeatable — pass multiple
--system flags to compare systems in one run. At least one is required.Available systems: naive, embedding, rlm, mem0, zep.OpenAI-compatible relay URL. All memory systems use this endpoint to call the language model.
Memory dataset to evaluate on. Repeatable. Accepts
locomo or longmemeval.When omitted, defaults to locomo. Pass both to merge the datasets:Model name used for answering questions and running the
MemoryJudge evaluator.Bearer token for the relay. When omitted, falls back to the
OPENAI_API_KEY environment variable.Maximum number of conversations to evaluate. Useful for quick runs.
Comma-separated list of QA types to filter. When omitted, all QA types are included.Example types:
temporal, multi_hop, single_hop, adversarial.Output format. One of
table, json, or html.Score field to report as the primary metric. Also controls which field is used in the per-QA-type breakdown table.
Random seed for dataset splits. Set this to reproduce results across runs.
Memory systems
| System | Description | Extra dependency |
|---|---|---|
naive | Concatenates all conversation turns into the context window | None |
embedding | Embeds turns and retrieves the top-k most relevant at query time | None |
rlm | Retrieval-augmented language model using LanceDB and sentence-transformers | pip install dspy lancedb duckdb sentence-transformers |
mem0 | Integrates the mem0ai memory layer | pip install context-bench[mem0] |
zep | Integrates Zep’s graphiti-core knowledge graph memory | pip install context-bench[zep] |
Available datasets
| Name | Description |
|---|---|
locomo | LoCoMo — long-conversation QA with temporal and multi-hop questions |
longmemeval | LongMemEval — long-form memory evaluation benchmark |
Examples
The
memory subcommand automatically adds a MemoryJudge evaluator alongside AnswerQuality, and computes per-QA-type score breakdowns in the output table.