What are Evals?
Evals (evaluations) help you measure and improve AI agent quality by:- Scoring Responses - Rate outputs on metrics like accuracy, relevance, and toxicity
- Comparing Outputs - Evaluate different models or prompts
- Catching Issues - Detect hallucinations, bias, and other problems
- Improving Quality - Iterate on prompts and model configurations
Scorer Types
Mastra provides two categories of scorers:LLM Scorers
Use a judge model to evaluate responses:createFaithfulnessScorer- Checks if response is supported by contextcreateAnswerRelevancyScorer- Measures relevance to questioncreateContextRelevanceScorer- Evaluates context qualitycreateContextPrecisionScorer- Checks context precisioncreateHallucinationScorer- Detects hallucinated informationcreateBiasScorer- Identifies biased languagecreateToxicityScorer- Detects toxic contentcreatePromptAlignmentScorer- Checks instruction followingcreateToolCallAccuracyScorer- Evaluates tool usage
Code Scorers
Deterministic heuristics that don’t require external models:createContentSimilarityScorer- Text similarity using Jaccard indexcreateTextualDifferenceScorer- Levenshtein distancecreateKeywordCoverageScorer- Keyword presence checkcreateCompletenessScorer- Required elements coveragecreateToneScorer- Sentiment analysiscreateToolCallAccuracyScorerCode- Deterministic tool accuracy
Quick Start
Install
Basic Example
Scoring Agent Runs
Score outputs from Mastra agent executions:Evaluation Workflows
Batch Evaluation
Evaluate multiple test cases:A/B Testing
Compare different models or prompts:Scorer Pipeline
Combine multiple scorers:Benefits
Quality Assurance
Catch quality issues before production
Continuous Improvement
Track metrics over time to improve agents
Model Comparison
Compare different models and configurations
Cost Optimization
Find the right balance of quality and cost
Integration with Observability
Scorers work seamlessly with Mastra’s observability system:Next Steps
Creating Evals
Build custom evaluation workflows
Using Scorers
Learn about prebuilt scorers