Skip to main content

Overview

JudgeRubric extends Rubric to provide LLM-as-judge scoring. It uses a language model to evaluate whether responses are correct by comparing them against ground truth answers.

Constructor

JudgeRubric(
    parser: Parser | None = None,
    parallelize_scoring: bool = False,
    judge_client: AsyncOpenAI | None = None,
    judge_model: str = "gpt-4.1-nano",
    judge_sampling_args: dict[str, Any] | None = None,
    judge_prompt: str = DEFAULT_JUDGE_PROMPT,
)
parser
Parser | None
default:"None"
Parser for extracting answers from completions. Defaults to vf.Parser().
parallelize_scoring
bool
default:"False"
Whether to parallelize judge API calls across multiple rollouts.
judge_client
AsyncOpenAI | None
default:"None"
OpenAI client for judge model calls. Defaults to AsyncOpenAI() with environment API key.
judge_model
str
default:"gpt-4.1-nano"
Model identifier for the judge. Can be any OpenAI-compatible model.
judge_sampling_args
dict[str, Any] | None
default:"None"
Additional sampling parameters for judge completions (e.g., temperature, max_tokens).
judge_prompt
str
default:"DEFAULT_JUDGE_PROMPT"
Template for judge prompts. Must include {question}, {answer}, and {response} placeholders.

Default Judge Prompt

The default prompt template is:
DEFAULT_JUDGE_PROMPT = """Given a ground truth answer \
and a response, determine if the response is correct.

Question:

Ground truth answer:

Response:

Respond either "yes" or "no" only."""

Methods

judge

async def judge(
    self,
    prompt: Messages,
    completion: Messages,
    answer: str,
    state: State | None = None,
) -> str
Call the judge model to evaluate a response. Caches results in state["judge_response"] if state is provided.
prompt
Messages
The input prompt (either string or list of message dicts).
completion
Messages
The model’s completion to evaluate.
answer
str
Ground truth answer for comparison.
state
State | None
default:"None"
Optional state dict for caching judge responses.
Returns: str - The judge model’s response (typically “yes” or “no”).
Judge responses are cached by prompt to avoid redundant API calls for the same evaluation.

Inherited Methods

All methods from Rubric are available:
  • add_reward_func(func, weight=1.0)
  • add_metric(func, weight=0.0)
  • score_rollout(state)
  • score_group(states)
See the Rubric documentation for details.

Class Objects

The following objects are automatically available to reward functions:
judge
callable
The judge() method, callable as judge(prompt, completion, answer, state=None).
judge_client
AsyncOpenAI
The OpenAI client instance.
judge_model
str
The judge model identifier.
judge_prompt
str
The judge prompt template.
judge_sampling_args
dict
Sampling arguments for judge calls.
parser
Parser
The parser instance.

Example Usage

Basic Judge Scoring

import verifiers as vf
from openai import AsyncOpenAI

# Create judge rubric with custom model
rubric = vf.JudgeRubric(
    judge_client=AsyncOpenAI(api_key="sk-..."),
    judge_model="gpt-4o-mini",
    judge_sampling_args={"temperature": 0.0}
)

# Add custom reward function using the judge
async def judge_correctness(prompt, completion, answer, judge, state, **kwargs):
    """Use judge to determine correctness."""
    response = await judge(prompt, completion, answer, state)
    return 1.0 if "yes" in response.lower() else 0.0

rubric.add_reward_func(judge_correctness)

# Score a state
state = {
    "prompt": "What is the capital of France?",
    "completion": [{"role": "assistant", "content": "Paris"}],
    "answer": "Paris",
    "task": "qa",
    "timing": {"scoring_ms": 0, "total_ms": 0}
}

await rubric.score_rollout(state)
print(f"Reward: {state['reward']}")  # 1.0 if judge says "yes"

Custom Judge Prompt

custom_prompt = """Evaluate if the response correctly answers the question.

Question: {question}
Expected: {answer}
Got: {response}

Reply with CORRECT or INCORRECT."""

rubric = vf.JudgeRubric(
    judge_model="gpt-4o",
    judge_prompt=custom_prompt,
    judge_sampling_args={
        "temperature": 0.0,
        "max_tokens": 10
    }
)

Using Judge in Reward Functions

rubric = vf.JudgeRubric(judge_model="gpt-4o-mini")

# Access judge as a class object
async def strict_correctness(judge, prompt, completion, answer, state, **kwargs):
    """Strict yes/no scoring."""
    result = await judge(prompt, completion, answer, state)
    return 1.0 if result.strip().lower() == "yes" else 0.0

async def partial_credit(judge, prompt, completion, answer, state, **kwargs):
    """Partial credit based on judge confidence."""
    result = await judge(prompt, completion, answer, state)
    if "yes" in result.lower():
        return 1.0
    elif "partially" in result.lower():
        return 0.5
    return 0.0

rubric.add_reward_func(strict_correctness)
rubric.add_metric(partial_credit, weight=0.0)  # Track but don't use for reward

Error Handling

from openai import RateLimitError, APITimeoutError

rubric = vf.JudgeRubric(
    judge_model="gpt-4o",
    judge_sampling_args={
        "timeout": 30.0,  # 30 second timeout
    }
)

try:
    await rubric.score_rollout(state)
except RuntimeError as e:
    if "rate limit" in str(e).lower():
        print("Reduce concurrency or wait before retrying")
    elif "timeout" in str(e).lower():
        print("Increase timeout in judge_sampling_args")
    raise

Notes

Judge API calls can be slow and expensive. Consider:
  • Using cheaper/faster models like gpt-4.1-nano for high-throughput evaluations
  • Caching judge responses by passing state parameter
  • Setting appropriate timeouts in judge_sampling_args
The max_tokens parameter is automatically converted to max_completion_tokens for compatibility with OpenAI’s chat API.

See Also

Build docs developers (and LLMs) love