FalseMemoryRate

FalseMemoryRate uses an LLM judge to detect whether a memory system’s answer contains hallucinated facts — specific claims (dates, names, numbers, events) that could not have been derived from the conversation history.

from context_bench.evaluators import FalseMemoryRate

Constructor

FalseMemoryRate(
    base_url: str,
    model: str = "claude-haiku-4-5-20251001",
    api_key: str | None = None,
    timeout: float = 60.0,
    max_retries: int = 3,
    retry_base_delay: float = 1.0,
)

Parameters

base_url

str

required

Root URL of the relay/proxy, e.g. "http://localhost:8080". All requests are routed through this endpoint.

model

str

default:"claude-haiku-4-5-20251001"

Model name sent to the relay.

api_key

str | None

default:"None"

Bearer token. Falls back to the OPENAI_API_KEY environment variable when None.

timeout

float

default:"60.0"

HTTP request timeout in seconds per attempt.

max_retries

int

default:"3"

Retries on transient failures (HTTP 429/5xx, connection errors).

retry_base_delay

float

default:"1.0"

Base delay in seconds for exponential backoff between retries.

score()

def score(self, original: dict, processed: dict) -> dict[str, float]

original

dict

required

The unmodified example dict. Uses "context" (conversation history) and "question".

processed

dict

required

The output dict returned by the system. Uses "response" (the system’s answer).

Return values

false_memory

float

1.0 if the judge detects hallucination (the answer contains specific facts absent from the conversation), 0.0 otherwise. On judge failure, returns 0.0 (conservative) rather than propagating the exception.

How it works

FalseMemoryRate prompts the LLM judge with the conversation context, the question, and the system’s answer. The judge is instructed to detect whether the answer contains specific claims (dates, names, numbers, places, events) that are absent from the conversation and therefore potentially fabricated. Facts that appear in the conversation are not flagged as hallucinations — the memory system is expected to recall them.

Usage

from context_bench.evaluators import FalseMemoryRate

fmr = FalseMemoryRate(base_url="http://localhost:8080")

fmr.score(
    {
        "context": "USER: I visited Tokyo in March 2022.\nASSISTANT: That's great!",
        "question": "When did the user visit Tokyo?",
    },
    {"response": "The user visited Tokyo in March 2022."},
)
# {"false_memory": 0.0}  — answer is grounded in the conversation

fmr.score(
    {
        "context": "USER: I visited Tokyo last spring.\nASSISTANT: That's great!",
        "question": "When did the user visit Tokyo?",
    },
    {"response": "The user visited Tokyo in March 2022."},
)
# {"false_memory": 1.0}  — specific date not in conversation

FalseMemoryRate requires a live relay endpoint. It is used automatically by the context-bench memory CLI subcommand alongside AnswerQuality and MemoryJudge.

When it is used

FalseMemoryRate is designed for memory system evaluations (evaluate_memory()). The context-bench memory subcommand uses AnswerQuality and MemoryJudge by default; add FalseMemoryRate via the Python API for hallucination tracking:

from context_bench import evaluate_memory
from context_bench.evaluators import AnswerQuality, FalseMemoryRate, MemoryJudge

result = evaluate_memory(
    systems=[my_memory_system],
    dataset=locomo_data,
    evaluators=[
        AnswerQuality(),
        MemoryJudge(base_url="http://localhost:8080"),
        FalseMemoryRate(base_url="http://localhost:8080"),
    ],
)

Python API

Evaluators

Metrics

Datasets

Constructor

Parameters

score()

Return values

How it works

Usage

When it is used

Build docs developers (and LLMs) love

Python API

Evaluators

Metrics

Datasets

​Constructor

​Parameters

​score()

​Return values

​How it works

​Usage

​When it is used

Build docs developers (and LLMs) love

Constructor

Parameters

score()

Return values

How it works

Usage

When it is used