Skip to main content
NLILabelMatch evaluates natural language inference and fact-verification tasks by extracting the predicted classification label from free-form model output and comparing it to the ground-truth label. A built-in alias table maps common variants (e.g. "yes""entailment") to canonical forms before comparison.
from context_bench.evaluators import NLILabelMatch

Constructor

NLILabelMatch takes no constructor parameters.
ev = NLILabelMatch()

score()

def score(self, original: dict[str, Any], processed: dict[str, Any]) -> dict[str, float]
original
dict
required
The unmodified example dict. Must contain an "answer" key with the reference label string.
processed
dict
required
The output dict returned by the system under test. Must contain a "response" key with the model’s output string.

Return values

nli_accuracy
float
required
1.0 if the extracted label matches the reference label (after alias normalization), otherwise 0.0.
If answer is empty, returns nli_accuracy: 1.0. If response is empty, returns nli_accuracy: 0.0.

Example

from context_bench.evaluators import NLILabelMatch
ev = NLILabelMatch()
ev.score({"answer": "Entailment"}, {"response": "Yes, this is true."})
# {'nli_accuracy': 1.0}  ("yes" maps to "entailment")

Auto-wired datasets

NLILabelMatch is automatically applied when any of the following datasets are selected:
CLI nameDataset
contract-nliContractNLI (legal NLI)
scifactSciFact (scientific claim verification)

Label alias mapping

Both the reference and response are normalized through the same alias table before comparison:
InputCanonical label
entailment, entail, yes, trueentailment
contradiction, contradict, no, falsecontradiction
not mentioned, not_mentioned, neutral, unknown, neithernot mentioned
supports, supportsupports
refutes, refuterefutes

Implementation notes

Label extraction from the response uses a three-stage cascade:
  1. Exact match — if the entire (lowercased) response is a known alias, return its canonical form immediately.
  2. Structured pattern — searches for phrases like "answer is: entailment", "label: supports", "therefore, the verdict is contradiction", and normalizes the captured word.
  3. Last-occurrence scan — scans the entire response for any known alias keyword and returns the canonical form of the last occurrence, since conclusions typically appear at the end of a response.

Build docs developers (and LLMs) love