MemoryJudge

MemoryJudge uses an external LLM to judge whether a memory system’s answer correctly addresses a question, given the ground-truth answer. It returns a binary score and is designed for short-answer memory QA tasks where token-overlap metrics like F1 can be unreliable.

from context_bench.evaluators import MemoryJudge

Constructor

MemoryJudge(
    base_url: str,
    model: str = "claude-haiku-4-5-20251001",
    api_key: str | None = None,
    timeout: float = 60.0,
    max_retries: int = 3,
    retry_base_delay: float = 1.0,
)

Parameters

base_url

str

required

Root URL of the relay/proxy, e.g. "http://localhost:8080". All requests are routed through this endpoint.

model

str

default:"claude-haiku-4-5-20251001"

Model name sent to the relay.

api_key

str | None

default:"None"

Bearer token. Falls back to the OPENAI_API_KEY environment variable when None.

timeout

float

default:"60.0"

HTTP request timeout in seconds per attempt.

max_retries

int

default:"3"

Retries on transient failures (HTTP 429/5xx, connection errors).

retry_base_delay

float

default:"1.0"

Base delay in seconds for exponential backoff between retries.

score()

def score(self, original: dict, processed: dict) -> dict[str, float]

original

dict

required

The unmodified example dict. Uses "question" and "answer" (ground truth).

processed

dict

required

The output dict returned by the system. Uses "response" (the system’s answer).

Return values

memory_judge

float

Binary score: 1.0 if the judge considers the answer correct, 0.0 otherwise. On judge failure, returns 0.0 rather than propagating the exception.

memory_judge_raw

float

Currently identical to memory_judge. Reserved for future partial-credit scoring.

When the gold answer is empty (no ground truth available), MemoryJudge returns 0.5 for both fields as a neutral score — neither inflating nor deflating the system’s mean.

How it works

MemoryJudge prompts the LLM with the question, the gold answer, and the system’s answer. The judge is instructed to return YES if the system answer conveys the same essential information as the gold answer (minor wording differences and extra context are acceptable), and NO otherwise.

Usage

from context_bench.evaluators import MemoryJudge

judge = MemoryJudge(base_url="http://localhost:8080", model="claude-haiku-4-5-20251001")

judge.score(
    {"question": "Where did Alice grow up?", "answer": "Paris"},
    {"response": "Alice grew up in Paris, France."},
)
# {"memory_judge": 1.0, "memory_judge_raw": 1.0}

judge.score(
    {"question": "Where did Alice grow up?", "answer": "Paris"},
    {"response": "Alice grew up in London."},
)
# {"memory_judge": 0.0, "memory_judge_raw": 0.0}

When it is used

MemoryJudge is automatically used by the context-bench memory CLI subcommand alongside AnswerQuality. Use it via the Python API for full control:

from context_bench import evaluate_memory
from context_bench.evaluators import AnswerQuality, MemoryJudge

result = evaluate_memory(
    systems=[my_memory_system],
    dataset=locomo_data,
    evaluators=[
        AnswerQuality(),
        MemoryJudge(base_url="http://localhost:8080"),
    ],
)

# Access judge scores
for row in result.rows:
    print(row.scores["memory_judge"], row.scores["f1"])

Python API

Evaluators

Metrics

Datasets

Constructor

Parameters

score()

Return values

How it works

Usage

When it is used

Build docs developers (and LLMs) love

Python API

Evaluators

Metrics

Datasets

​Constructor

​Parameters

​score()

​Return values

​How it works

​Usage

​When it is used

Build docs developers (and LLMs) love

Constructor

Parameters

score()

Return values

How it works

Usage

When it is used