MemoryJudge uses an external LLM to judge whether a memory system’s answer correctly addresses a question, given the ground-truth answer. It returns a binary score and is designed for short-answer memory QA tasks where token-overlap metrics like F1 can be unreliable.
Constructor
Parameters
Root URL of the relay/proxy, e.g.
"http://localhost:8080". All requests are routed through this endpoint.Model name sent to the relay.
Bearer token. Falls back to the
OPENAI_API_KEY environment variable when None.HTTP request timeout in seconds per attempt.
Retries on transient failures (HTTP 429/5xx, connection errors).
Base delay in seconds for exponential backoff between retries.
score()
The unmodified example dict. Uses
"question" and "answer" (ground truth).The output dict returned by the system. Uses
"response" (the system’s answer).Return values
Binary score:
1.0 if the judge considers the answer correct, 0.0 otherwise. On judge failure, returns 0.0 rather than propagating the exception.Currently identical to
memory_judge. Reserved for future partial-credit scoring.When the gold answer is empty (no ground truth available),
MemoryJudge returns 0.5 for both fields as a neutral score — neither inflating nor deflating the system’s mean.How it works
MemoryJudge prompts the LLM with the question, the gold answer, and the system’s answer. The judge is instructed to return YES if the system answer conveys the same essential information as the gold answer (minor wording differences and extra context are acceptable), and NO otherwise.
Usage
When it is used
MemoryJudge is automatically used by the context-bench memory CLI subcommand alongside AnswerQuality. Use it via the Python API for full control:
