Memory Subcommand

The memory subcommand evaluates stateful memory systems on long-conversation QA datasets. It measures how well each system retains and retrieves information across turns, scoring answers against ground truth with F1, exact match, and per-QA-type breakdowns.

Usage

context-bench memory --system NAME --relay URL [options]

Flags

--system

string

required

Memory system to evaluate. Repeatable — pass multiple --system flags to compare systems in one run. At least one is required.Available systems: naive, embedding, rlm, mem0, zep.

context-bench memory --system naive --system mem0 --relay http://localhost:7878

--relay

string

required

OpenAI-compatible relay URL. All memory systems use this endpoint to call the language model.

--dataset

string

default:"locomo"

Memory dataset to evaluate on. Repeatable. Accepts locomo or longmemeval.When omitted, defaults to locomo. Pass both to merge the datasets:

context-bench memory --system naive --relay http://localhost:7878 \
  --dataset locomo --dataset longmemeval

--model

string

default:"claude-haiku-4-5-20251001"

Model name used for answering questions and running the MemoryJudge evaluator.

--api-key

string

Bearer token for the relay. When omitted, falls back to the OPENAI_API_KEY environment variable.

-n

integer

default:"all"

Maximum number of conversations to evaluate. Useful for quick runs.

--qa-types

string

Comma-separated list of QA types to filter. When omitted, all QA types are included.Example types: temporal, multi_hop, single_hop, adversarial.

--qa-types temporal,multi_hop

--output

string

default:"table"

Output format. One of table, json, or html.

--score-field

string

default:"f1"

Score field to report as the primary metric. Also controls which field is used in the per-QA-type breakdown table.

--seed

integer

default:"42"

Random seed for dataset splits. Set this to reproduce results across runs.

Memory systems

System	Description	Extra dependency
`naive`	Concatenates all conversation turns into the context window	None
`embedding`	Embeds turns and retrieves the top-k most relevant at query time	None
`rlm`	Retrieval-augmented language model using LanceDB and sentence-transformers	`pip install dspy lancedb duckdb sentence-transformers`
`mem0`	Integrates the mem0ai memory layer	`pip install context-bench[mem0]`
`zep`	Integrates Zep’s graphiti-core knowledge graph memory	`pip install context-bench[zep]`

The mem0 and zep systems require optional dependencies. Install them before use or the CLI will exit with an installation hint.

pip install context-bench[mem0]
pip install context-bench[zep]

Available datasets

Name	Description
`locomo`	LoCoMo — long-conversation QA with temporal and multi-hop questions
`longmemeval`	LongMemEval — long-form memory evaluation benchmark

Examples

context-bench memory --system naive --relay http://localhost:7878

The memory subcommand automatically adds a MemoryJudge evaluator alongside AnswerQuality, and computes per-QA-type score breakdowns in the output table.

Get Started

CLI Reference

Core Concepts

Guides

Usage

Flags

Memory systems

Available datasets

Examples

Build docs developers (and LLMs) love

Get Started

CLI Reference

Core Concepts

Guides

​Usage

​Flags

​Memory systems

​Available datasets

​Examples

Build docs developers (and LLMs) love

Usage

Flags

Memory systems

Available datasets

Examples