Skip to main content
MemoryJudge uses an external LLM to judge whether a memory system’s answer correctly addresses a question, given the ground-truth answer. It returns a binary score and is designed for short-answer memory QA tasks where token-overlap metrics like F1 can be unreliable.
from context_bench.evaluators import MemoryJudge

Constructor

MemoryJudge(
    base_url: str,
    model: str = "claude-haiku-4-5-20251001",
    api_key: str | None = None,
    timeout: float = 60.0,
    max_retries: int = 3,
    retry_base_delay: float = 1.0,
)

Parameters

base_url
str
required
Root URL of the relay/proxy, e.g. "http://localhost:8080". All requests are routed through this endpoint.
model
str
default:"claude-haiku-4-5-20251001"
Model name sent to the relay.
api_key
str | None
default:"None"
Bearer token. Falls back to the OPENAI_API_KEY environment variable when None.
timeout
float
default:"60.0"
HTTP request timeout in seconds per attempt.
max_retries
int
default:"3"
Retries on transient failures (HTTP 429/5xx, connection errors).
retry_base_delay
float
default:"1.0"
Base delay in seconds for exponential backoff between retries.

score()

def score(self, original: dict, processed: dict) -> dict[str, float]
original
dict
required
The unmodified example dict. Uses "question" and "answer" (ground truth).
processed
dict
required
The output dict returned by the system. Uses "response" (the system’s answer).

Return values

memory_judge
float
Binary score: 1.0 if the judge considers the answer correct, 0.0 otherwise. On judge failure, returns 0.0 rather than propagating the exception.
memory_judge_raw
float
Currently identical to memory_judge. Reserved for future partial-credit scoring.
When the gold answer is empty (no ground truth available), MemoryJudge returns 0.5 for both fields as a neutral score — neither inflating nor deflating the system’s mean.

How it works

MemoryJudge prompts the LLM with the question, the gold answer, and the system’s answer. The judge is instructed to return YES if the system answer conveys the same essential information as the gold answer (minor wording differences and extra context are acceptable), and NO otherwise.

Usage

from context_bench.evaluators import MemoryJudge

judge = MemoryJudge(base_url="http://localhost:8080", model="claude-haiku-4-5-20251001")

judge.score(
    {"question": "Where did Alice grow up?", "answer": "Paris"},
    {"response": "Alice grew up in Paris, France."},
)
# {"memory_judge": 1.0, "memory_judge_raw": 1.0}

judge.score(
    {"question": "Where did Alice grow up?", "answer": "Paris"},
    {"response": "Alice grew up in London."},
)
# {"memory_judge": 0.0, "memory_judge_raw": 0.0}

When it is used

MemoryJudge is automatically used by the context-bench memory CLI subcommand alongside AnswerQuality. Use it via the Python API for full control:
from context_bench import evaluate_memory
from context_bench.evaluators import AnswerQuality, MemoryJudge

result = evaluate_memory(
    systems=[my_memory_system],
    dataset=locomo_data,
    evaluators=[
        AnswerQuality(),
        MemoryJudge(base_url="http://localhost:8080"),
    ],
)

# Access judge scores
for row in result.rows:
    print(row.scores["memory_judge"], row.scores["f1"])

Build docs developers (and LLMs) love