JudgeRubric

Overview

JudgeRubric extends Rubric to provide LLM-as-judge scoring. It uses a language model to evaluate whether responses are correct by comparing them against ground truth answers.

Constructor

JudgeRubric(
    parser: Parser | None = None,
    parallelize_scoring: bool = False,
    judge_client: AsyncOpenAI | None = None,
    judge_model: str = "gpt-4.1-nano",
    judge_sampling_args: dict[str, Any] | None = None,
    judge_prompt: str = DEFAULT_JUDGE_PROMPT,
)

parser

Parser | None

default:"None"

Parser for extracting answers from completions. Defaults to vf.Parser().

parallelize_scoring

bool

default:"False"

Whether to parallelize judge API calls across multiple rollouts.

judge_client

AsyncOpenAI | None

default:"None"

OpenAI client for judge model calls. Defaults to AsyncOpenAI() with environment API key.

judge_model

str

default:"gpt-4.1-nano"

Model identifier for the judge. Can be any OpenAI-compatible model.

judge_sampling_args

dict[str, Any] | None

default:"None"

Additional sampling parameters for judge completions (e.g., temperature, max_tokens).

judge_prompt

str

default:"DEFAULT_JUDGE_PROMPT"

Template for judge prompts. Must include {question}, {answer}, and {response} placeholders.

Default Judge Prompt

The default prompt template is:

DEFAULT_JUDGE_PROMPT = """Given a ground truth answer \
and a response, determine if the response is correct.

Question:


Ground truth answer:


Response:


Respond either "yes" or "no" only."""

Methods

judge

async def judge(
    self,
    prompt: Messages,
    completion: Messages,
    answer: str,
    state: State | None = None,
) -> str

Call the judge model to evaluate a response. Caches results in state["judge_response"] if state is provided.

prompt

Messages

The input prompt (either string or list of message dicts).

completion

Messages

The model’s completion to evaluate.

answer

str

Ground truth answer for comparison.

state

State | None

default:"None"

Optional state dict for caching judge responses.

Returns: str - The judge model’s response (typically “yes” or “no”).

Judge responses are cached by prompt to avoid redundant API calls for the same evaluation.

Inherited Methods

All methods from Rubric are available:

add_reward_func(func, weight=1.0)
add_metric(func, weight=0.0)
score_rollout(state)
score_group(states)

See the Rubric documentation for details.

Class Objects

The following objects are automatically available to reward functions:

judge

callable

The judge() method, callable as judge(prompt, completion, answer, state=None).

judge_client

AsyncOpenAI

The OpenAI client instance.

judge_model

str

The judge model identifier.

judge_prompt

str

The judge prompt template.

judge_sampling_args

dict

Sampling arguments for judge calls.

parser

Parser

The parser instance.

Example Usage

Basic Judge Scoring

import verifiers as vf
from openai import AsyncOpenAI

# Create judge rubric with custom model
rubric = vf.JudgeRubric(
    judge_client=AsyncOpenAI(api_key="sk-..."),
    judge_model="gpt-4o-mini",
    judge_sampling_args={"temperature": 0.0}
)

# Add custom reward function using the judge
async def judge_correctness(prompt, completion, answer, judge, state, **kwargs):
    """Use judge to determine correctness."""
    response = await judge(prompt, completion, answer, state)
    return 1.0 if "yes" in response.lower() else 0.0

rubric.add_reward_func(judge_correctness)

# Score a state
state = {
    "prompt": "What is the capital of France?",
    "completion": [{"role": "assistant", "content": "Paris"}],
    "answer": "Paris",
    "task": "qa",
    "timing": {"scoring_ms": 0, "total_ms": 0}
}

await rubric.score_rollout(state)
print(f"Reward: {state['reward']}")  # 1.0 if judge says "yes"

Custom Judge Prompt

custom_prompt = """Evaluate if the response correctly answers the question.

Question: {question}
Expected: {answer}
Got: {response}

Reply with CORRECT or INCORRECT."""

rubric = vf.JudgeRubric(
    judge_model="gpt-4o",
    judge_prompt=custom_prompt,
    judge_sampling_args={
        "temperature": 0.0,
        "max_tokens": 10
    }
)

Using Judge in Reward Functions

rubric = vf.JudgeRubric(judge_model="gpt-4o-mini")

# Access judge as a class object
async def strict_correctness(judge, prompt, completion, answer, state, **kwargs):
    """Strict yes/no scoring."""
    result = await judge(prompt, completion, answer, state)
    return 1.0 if result.strip().lower() == "yes" else 0.0

async def partial_credit(judge, prompt, completion, answer, state, **kwargs):
    """Partial credit based on judge confidence."""
    result = await judge(prompt, completion, answer, state)
    if "yes" in result.lower():
        return 1.0
    elif "partially" in result.lower():
        return 0.5
    return 0.0

rubric.add_reward_func(strict_correctness)
rubric.add_metric(partial_credit, weight=0.0)  # Track but don't use for reward

Error Handling

from openai import RateLimitError, APITimeoutError

rubric = vf.JudgeRubric(
    judge_model="gpt-4o",
    judge_sampling_args={
        "timeout": 30.0,  # 30 second timeout
    }
)

try:
    await rubric.score_rollout(state)
except RuntimeError as e:
    if "rate limit" in str(e).lower():
        print("Reduce concurrency or wait before retrying")
    elif "timeout" in str(e).lower():
        print("Increase timeout in judge_sampling_args")
    raise

Notes

Judge API calls can be slow and expensive. Consider:

Using cheaper/faster models like gpt-4.1-nano for high-throughput evaluations
Caching judge responses by passing state parameter
Setting appropriate timeouts in judge_sampling_args

The max_tokens parameter is automatically converted to max_completion_tokens for compatibility with OpenAI’s chat API.

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

Overview

Constructor

Default Judge Prompt

Methods

judge

Inherited Methods

Class Objects

Example Usage

Basic Judge Scoring

Custom Judge Prompt

Using Judge in Reward Functions

Error Handling

Notes

See Also

Build docs developers (and LLMs) love

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

​Overview

​Constructor

​Default Judge Prompt

​Methods

​judge

​Inherited Methods

​Class Objects

​Example Usage

​Basic Judge Scoring

​Custom Judge Prompt

​Using Judge in Reward Functions

​Error Handling

​Notes

​See Also

Build docs developers (and LLMs) love

Overview

Constructor

Default Judge Prompt

Methods

judge

Inherited Methods

Class Objects

Example Usage

Basic Judge Scoring

Custom Judge Prompt

Using Judge in Reward Functions

Error Handling

Notes

See Also