LLMJudge scores open-ended responses by calling an external OpenAI-compatible LLM endpoint with an MT-Bench / AlpacaEval style evaluation prompt. The judge rates the response on a 1–5 scale, which is then normalized to a 0–1 float. It requires no additional Python dependencies beyond the standard library.
Constructor
Base URL of an OpenAI-compatible API endpoint (e.g.
"http://localhost:9090"). The evaluator will POST to {base_url}/v1/chat/completions.Model name to pass in the judge request (e.g.
"gpt-4", "claude-3-5-sonnet-20241022").Bearer token for the judge endpoint. Falls back to the
OPENAI_API_KEY environment variable if not provided.HTTP request timeout in seconds. If the judge call exceeds this limit,
judge_score: 0.0 is returned.score()
The unmodified example dict. Uses
"question" (or falls back to "context") as the question and "answer" as the reference answer for the judge prompt.The output dict returned by the system under test. Must contain a
"response" key with the model’s output string.Return values
The judge’s rating normalized from the 1–5 scale to the 0–1 range using the formula
(rating - 1) / 4.0. A rating of 5 maps to 1.0; a rating of 1 maps to 0.0.If
response is empty or the judge call fails for any reason, judge_score: 0.0 is returned rather than raising an exception, so the benchmark continues running.Examples
CLI — add--judge-url to enable LLMJudge for any dataset:
When it is used
LLMJudge is not auto-wired to specific datasets the way other evaluators are. It is activated in two ways:
- CLI: Pass
--judge-url <url>(and optionally--judge-model <name>). The judge is added on top of whichever other evaluators are selected. - Python API: Construct
LLMJudgeexplicitly and include it in theevaluatorslist passed toevaluate().
alpaca-eval and mt-bench, where there is no single ground-truth answer.
Implementation notes
The judge prompt follows the MT-Bench / AlpacaEval format, presenting the question, reference answer, and the model’s response, then asking for a single integer rating from 1 to 5:- 1 — Completely wrong or irrelevant
- 2 — Partially addresses the question but has major errors
- 3 — Addresses the question but misses key details
- 4 — Good response with minor issues
- 5 — Excellent, fully addresses the question
temperature=0 and max_tokens=16 to ensure a deterministic single-integer output. The response is parsed with a regex that extracts the first digit in the range 1–5; if no digit is found, a rating of 1 is assumed.
The HTTP call uses urllib.request from the Python standard library — no httpx or requests dependency is required.