Overview
JudgeRubric extends Rubric to provide LLM-as-judge scoring. It uses a language model to evaluate whether responses are correct by comparing them against ground truth answers.
Constructor
JudgeRubric(
parser: Parser | None = None,
parallelize_scoring: bool = False,
judge_client: AsyncOpenAI | None = None,
judge_model: str = "gpt-4.1-nano",
judge_sampling_args: dict[str, Any] | None = None,
judge_prompt: str = DEFAULT_JUDGE_PROMPT,
)
parser
Parser | None
default:"None"
Parser for extracting answers from completions. Defaults to vf.Parser().
Whether to parallelize judge API calls across multiple rollouts.
judge_client
AsyncOpenAI | None
default:"None"
OpenAI client for judge model calls. Defaults to AsyncOpenAI() with environment API key.
judge_model
str
default:"gpt-4.1-nano"
Model identifier for the judge. Can be any OpenAI-compatible model.
judge_sampling_args
dict[str, Any] | None
default:"None"
Additional sampling parameters for judge completions (e.g., temperature, max_tokens).
judge_prompt
str
default:"DEFAULT_JUDGE_PROMPT"
Template for judge prompts. Must include {question}, {answer}, and {response} placeholders.
Default Judge Prompt
The default prompt template is:
DEFAULT_JUDGE_PROMPT = """Given a ground truth answer \
and a response, determine if the response is correct.
Question:
Respond either "yes" or "no" only."""
Methods
judge
async def judge(
self,
prompt: Messages,
completion: Messages,
answer: str,
state: State | None = None,
) -> str
Call the judge model to evaluate a response. Caches results in state["judge_response"] if state is provided.
The input prompt (either string or list of message dicts).
The model’s completion to evaluate.
Ground truth answer for comparison.
state
State | None
default:"None"
Optional state dict for caching judge responses.
Returns: str - The judge model’s response (typically “yes” or “no”).
Judge responses are cached by prompt to avoid redundant API calls for the same evaluation.
Inherited Methods
All methods from Rubric are available:
add_reward_func(func, weight=1.0)
add_metric(func, weight=0.0)
score_rollout(state)
score_group(states)
See the Rubric documentation for details.
Class Objects
The following objects are automatically available to reward functions:
The judge() method, callable as judge(prompt, completion, answer, state=None).
The OpenAI client instance.
The judge model identifier.
The judge prompt template.
Sampling arguments for judge calls.
Example Usage
Basic Judge Scoring
import verifiers as vf
from openai import AsyncOpenAI
# Create judge rubric with custom model
rubric = vf.JudgeRubric(
judge_client=AsyncOpenAI(api_key="sk-..."),
judge_model="gpt-4o-mini",
judge_sampling_args={"temperature": 0.0}
)
# Add custom reward function using the judge
async def judge_correctness(prompt, completion, answer, judge, state, **kwargs):
"""Use judge to determine correctness."""
response = await judge(prompt, completion, answer, state)
return 1.0 if "yes" in response.lower() else 0.0
rubric.add_reward_func(judge_correctness)
# Score a state
state = {
"prompt": "What is the capital of France?",
"completion": [{"role": "assistant", "content": "Paris"}],
"answer": "Paris",
"task": "qa",
"timing": {"scoring_ms": 0, "total_ms": 0}
}
await rubric.score_rollout(state)
print(f"Reward: {state['reward']}") # 1.0 if judge says "yes"
Custom Judge Prompt
custom_prompt = """Evaluate if the response correctly answers the question.
Question: {question}
Expected: {answer}
Got: {response}
Reply with CORRECT or INCORRECT."""
rubric = vf.JudgeRubric(
judge_model="gpt-4o",
judge_prompt=custom_prompt,
judge_sampling_args={
"temperature": 0.0,
"max_tokens": 10
}
)
Using Judge in Reward Functions
rubric = vf.JudgeRubric(judge_model="gpt-4o-mini")
# Access judge as a class object
async def strict_correctness(judge, prompt, completion, answer, state, **kwargs):
"""Strict yes/no scoring."""
result = await judge(prompt, completion, answer, state)
return 1.0 if result.strip().lower() == "yes" else 0.0
async def partial_credit(judge, prompt, completion, answer, state, **kwargs):
"""Partial credit based on judge confidence."""
result = await judge(prompt, completion, answer, state)
if "yes" in result.lower():
return 1.0
elif "partially" in result.lower():
return 0.5
return 0.0
rubric.add_reward_func(strict_correctness)
rubric.add_metric(partial_credit, weight=0.0) # Track but don't use for reward
Error Handling
from openai import RateLimitError, APITimeoutError
rubric = vf.JudgeRubric(
judge_model="gpt-4o",
judge_sampling_args={
"timeout": 30.0, # 30 second timeout
}
)
try:
await rubric.score_rollout(state)
except RuntimeError as e:
if "rate limit" in str(e).lower():
print("Reduce concurrency or wait before retrying")
elif "timeout" in str(e).lower():
print("Increase timeout in judge_sampling_args")
raise
Notes
Judge API calls can be slow and expensive. Consider:
- Using cheaper/faster models like
gpt-4.1-nano for high-throughput evaluations
- Caching judge responses by passing
state parameter
- Setting appropriate timeouts in
judge_sampling_args
The max_tokens parameter is automatically converted to max_completion_tokens for compatibility with OpenAI’s chat API.
See Also