Available Metrics
All pre-built metrics use LLM-as-a-judge with carefully crafted prompts. They require an LLM instance and support tool calling for structured outputs.Faithfulness
Detect hallucinations in grounded responses
Correctness
Evaluate factual accuracy
Conciseness
Check if responses are appropriately brief
Document Relevance
Assess retrieval quality in RAG
Tool Selection
Validate agent tool choices
Tool Invocation
Check tool call correctness
Refusal
Detect when models decline to answer
Exact Match
Simple string equality check
Faithfulness
Detects hallucinations by checking if a response is faithful to the provided context.Use Cases
- RAG applications where responses must be grounded in retrieved documents
- Question-answering systems with knowledge bases
- Summarization tasks that must preserve source material accuracy
Usage
Input Schema
The input query or question
The model’s response to evaluate
The reference context or source material
Output
Either
"faithful" or "unfaithful"1.0 if faithful, 0.0 if unfaithfulLLM’s reasoning for the classification
Correctness
Evaluates whether a response is factually accurate and complete.Use Cases
- Validating answers to factual questions
- Checking knowledge retention in educational apps
- Verifying accuracy of generated content
Usage
Input Schema
The input query or question
The model’s response to evaluate
Output
Either
"correct" or "incorrect"1.0 if correct, 0.0 if incorrectConciseness
Checks whether a response is appropriately brief without unnecessary verbosity.Use Cases
- Ensuring chatbots provide succinct answers
- Optimizing token usage in production systems
- Evaluating summary quality
Usage
Input Schema
The input query or question
The model’s response to evaluate
Output
Either
"concise" or "verbose"1.0 if concise, 0.0 if verboseDocument Relevance
Determines if a retrieved document contains information relevant to answering a question.Use Cases
- Evaluating retriever quality in RAG systems
- Measuring search result relevance
- Optimizing document ranking algorithms
Usage
Input Schema
The query or question
The retrieved document to evaluate
Output
Either
"relevant" or "unrelated"1.0 if relevant, 0.0 if unrelatedTool Selection
Evaluates whether an AI agent selected the correct tool(s) for a given task.Use Cases
- Validating agent decision-making
- Optimizing tool descriptions and schemas
- Measuring agent routing accuracy
Usage
Input Schema
The input query or conversation context
Description of available tools (plain text or JSON)
The tool(s) selected by the agent
Output
Either
"correct" or "incorrect"1.0 if correct tool selection, 0.0 if incorrectTool Invocation
Validates that a tool was invoked correctly with proper arguments and formatting.Use Cases
- Ensuring agents use tools properly
- Detecting hallucinated parameters
- Validating argument safety (no PII leakage)
Evaluation Criteria
Correct invocation requires:- Properly structured JSON (if applicable)
- All required parameters present
- No hallucinated/nonexistent fields
- Argument values match query and schema
- No unsafe content (PII, sensitive data)
- Missing required parameters
- Hallucinated fields not in schema
- Malformed JSON
- Wrong argument values
- Unsafe content in arguments
Usage
Input Schema
The conversation context or user query
Tool schemas (JSON schema or human-readable format)
The tool invocation(s) made by the agent
Output
Either
"correct" or "incorrect"1.0 if correct invocation, 0.0 if incorrectRefusal
Detects when an LLM refuses or declines to answer a query.Use Cases
- Monitoring over-refusal in production systems
- Detecting safety filter triggers
- Measuring assistant compliance rates
This metric is use-case agnostic: it only detects whether a refusal occurred, not whether the refusal was appropriate.
Usage
Detected Refusal Types
- Direct refusals: “I cannot help with that”
- Deflections: “Let me help you with something else”
- Scope disclaimers: “That’s outside my capabilities”
- Non-answers: Response doesn’t address the question
Input Schema
The user’s query or question
The model’s response to evaluate
Output
Either
"refused" or "answered"1.0 if refused, 0.0 if answered"neutral" - refusals can be good or bad depending on contextExact Match
Simple code-based evaluator that checks if two strings are exactly equal.Use Cases
- Validating structured outputs (IDs, codes, formats)
- Testing deterministic responses
- Baseline evaluation metric
Usage
Input Schema
The output to check
The expected output
Output
1.0 if exact match, 0.0 otherwise"code" - this is a code-based evaluatorMetrics Comparison Table
| Metric | Kind | Inputs | Use Case |
|---|---|---|---|
| Faithfulness | LLM | input, output, context | Detect hallucinations in RAG |
| Correctness | LLM | input, output | Validate factual accuracy |
| Conciseness | LLM | input, output | Check response brevity |
| Document Relevance | LLM | input, document_text | Evaluate retrieval quality |
| Tool Selection | LLM | input, available_tools, tool_selection | Validate agent decisions |
| Tool Invocation | LLM | input, available_tools, tool_selection | Check tool call correctness |
| Refusal | LLM | input, output | Detect model refusals |
| Exact Match | Code | output, expected | String equality |
Using Multiple Metrics
Combine multiple metrics for comprehensive evaluation:Customizing Pre-built Metrics
You can customize pre-built metrics by accessing their prompts:Next Steps
Custom Evaluators
Build your own evaluation logic
Batch Evaluation
Evaluate datasets at scale