LLM-Based Scorers
Scorers that use a judge model to evaluate responses.Faithfulness Scorer
Evaluates whether the response is supported by the provided context.Answer Relevancy Scorer
Measures how relevant the answer is to the question.Context Relevance Scorer
Evaluates if the provided context is relevant to the question.Context Precision Scorer
Checks if relevant context appears before irrelevant context.Hallucination Scorer
Detects if the response contains hallucinated information.Bias Scorer
Identifies biased language in responses.Toxicity Scorer
Detects toxic, harmful, or inappropriate content.Prompt Alignment Scorer
Checks if the response follows the given instructions.Tool Call Accuracy Scorer (LLM)
Evaluates if the agent used the correct tools.Code-Based Scorers
Deterministic scorers that don’t require a judge model.Content Similarity Scorer
Measures text similarity using Jaccard index.Textual Difference Scorer
Calculates Levenshtein distance between strings.Keyword Coverage Scorer
Checks if specific keywords are present.Completeness Scorer
Ensures all required elements are included.Tone Scorer
Analyzes sentiment/tone of the response.Tool Call Accuracy Scorer (Code)
Deterministic tool usage validation.Scoring Agent Runs
All scorers can evaluate Mastra agent runs:Combining Scorers
Use multiple scorers together:Scorer Configuration
Model Selection
Choose appropriate judge models:Scale Options
Adjust score ranges:Best Practices
Choose the Right Scorer
- Use LLM scorers for nuanced evaluation (relevancy, hallucination)
- Use code scorers for deterministic checks (keywords, format)
- Combine both for comprehensive evaluation
Optimize Costs
Batch Processing
Next Steps
Creating Evals
Build custom scorers
Observability
Track scores with observability