FalseMemoryRate uses an LLM judge to detect whether a memory system’s answer contains hallucinated facts — specific claims (dates, names, numbers, events) that could not have been derived from the conversation history.
Constructor
Parameters
Root URL of the relay/proxy, e.g.
"http://localhost:8080". All requests are routed through this endpoint.Model name sent to the relay.
Bearer token. Falls back to the
OPENAI_API_KEY environment variable when None.HTTP request timeout in seconds per attempt.
Retries on transient failures (HTTP 429/5xx, connection errors).
Base delay in seconds for exponential backoff between retries.
score()
The unmodified example dict. Uses
"context" (conversation history) and "question".The output dict returned by the system. Uses
"response" (the system’s answer).Return values
1.0 if the judge detects hallucination (the answer contains specific facts absent from the conversation), 0.0 otherwise. On judge failure, returns 0.0 (conservative) rather than propagating the exception.How it works
FalseMemoryRate prompts the LLM judge with the conversation context, the question, and the system’s answer. The judge is instructed to detect whether the answer contains specific claims (dates, names, numbers, places, events) that are absent from the conversation and therefore potentially fabricated.
Facts that appear in the conversation are not flagged as hallucinations — the memory system is expected to recall them.
Usage
FalseMemoryRate requires a live relay endpoint. It is used automatically by the context-bench memory CLI subcommand alongside AnswerQuality and MemoryJudge.When it is used
FalseMemoryRate is designed for memory system evaluations (evaluate_memory()). The context-bench memory subcommand uses AnswerQuality and MemoryJudge by default; add FalseMemoryRate via the Python API for hallucination tracking:
