Skip to main content
IFEvalChecker evaluates instruction-following by running a set of programmatic checks against the model’s response. Each check verifies one verifiable constraint from the IFEval benchmark.
from context_bench.evaluators import IFEvalChecker

Constructor

IFEvalChecker takes no constructor parameters.
ev = IFEvalChecker()

score()

def score(self, original: dict[str, Any], processed: dict[str, Any]) -> dict[str, float]
original
dict
required
The unmodified example dict. Must contain:
  • "instruction_id_list" — a list of instruction ID strings (e.g. ["punctuation:no_comma", "keywords:existence"]).
  • "kwargs" — a list of parameter dicts, one per instruction ID (e.g. [{}, {"keywords": ["hello"]}]).
processed
dict
required
The output dict returned by the system under test. Must contain a "response" key with the model’s output string.

Return values

ifeval_strict
float
required
1.0 if all instructions in instruction_id_list pass, otherwise 0.0.
ifeval_loose
float
required
Fraction of instructions that pass: passed / total. Ranges from 0.0 to 1.0.
If instruction_id_list is empty, both scores return 1.0. Unknown instruction IDs are counted as passing (fail-open).

Example

from context_bench.evaluators import IFEvalChecker
ev = IFEvalChecker()
ev.score(
    {"instruction_id_list": ["punctuation:no_comma", "keywords:existence"],
     "kwargs": [{}, {"keywords": ["hello"]}]},
    {"response": "hello world"},
)
# {'ifeval_strict': 1.0, 'ifeval_loose': 1.0}

Auto-wired datasets

CLI nameDataset
ifevalIFEval

Supported instruction types

The evaluator dispatches each instruction ID to a dedicated checker method. All 19 supported types are listed below.

Keyword constraints

Instruction IDDescriptionkwargs
keywords:existenceAll specified keywords must appear in the responsekeywords: list[str]
keywords:forbidden_wordsNone of the specified words may appearforbidden_words: list[str]
keywords:frequencyA keyword must appear at least / at most / exactly N timeskeyword: str, frequency: int, relation: str
keywords:letter_frequencyA specific letter must appear at least / at most / exactly N timesletter: str, let_frequency: int, let_relation: str

Length constraints

Instruction IDDescriptionkwargs
length_constraints:number_wordsResponse must have at least / at most / exactly N wordsnum_words: int, relation: str
length_constraints:number_sentencesResponse must have at least / at most / exactly N sentencesnum_sentences: int, relation: str
length_constraints:number_paragraphsResponse must have at least N paragraphs (double-newline separated)num_paragraphs: int

Format constraints

Instruction IDDescriptionkwargs
detectable_format:json_formatResponse must be valid JSON
detectable_format:number_bullet_pointsResponse must have at least N bullet points (*, -, or )num_bullets: int
detectable_format:titleResponse must contain a title in <<double angle brackets>>
detectable_format:constrained_responseResponse (stripped) must exactly equal one of the allowed stringsconstrained_response: list[str]

Content constraints

Instruction IDDescriptionkwargs
detectable_content:number_placeholdersResponse must contain at least N [placeholder] patternsnum_placeholders: int
detectable_content:postscriptResponse must contain the postscript marker (default P.S.)postscript_marker: str

Punctuation

Instruction IDDescriptionkwargs
punctuation:no_commaResponse must contain no commas

Case constraints

Instruction IDDescriptionkwargs
change_case:english_capitalEvery word must start with an uppercase letter
change_case:english_lowercaseEntire response must be lowercase

Start/end constraints

Instruction IDDescriptionkwargs
startend:end_checkerResponse must end with the specified phraseend_phrase: str

Combination constraints

Instruction IDDescriptionkwargs
combination:repeat_promptResponse must contain the original prompt verbatimprompt_to_repeat: str
combination:two_responsesResponse must contain ****** as a separator between two responses

Build docs developers (and LLMs) love