Skip to main content
Mastra Evals provides a collection of scoring utilities to evaluate AI agent responses. Use prebuilt scorers for common quality metrics or create custom scorers tailored to your needs.

What are Evals?

Evals (evaluations) help you measure and improve AI agent quality by:
  • Scoring Responses - Rate outputs on metrics like accuracy, relevance, and toxicity
  • Comparing Outputs - Evaluate different models or prompts
  • Catching Issues - Detect hallucinations, bias, and other problems
  • Improving Quality - Iterate on prompts and model configurations

Scorer Types

Mastra provides two categories of scorers:

LLM Scorers

Use a judge model to evaluate responses:
import { createFaithfulnessScorer } from '@mastra/evals/scorers/prebuilt';

const faithfulness = createFaithfulnessScorer({
  model: 'gpt-4o-mini',
});

const score = await faithfulness.score({
  answer: 'Paris is the capital of France.',
  context: ['France is a country in Europe', 'Paris is the capital of France'],
});

console.log(score);
// {
//   value: 1.0,
//   reason: 'The answer is fully supported by the context'
// }
Available LLM Scorers:
  • createFaithfulnessScorer - Checks if response is supported by context
  • createAnswerRelevancyScorer - Measures relevance to question
  • createContextRelevanceScorer - Evaluates context quality
  • createContextPrecisionScorer - Checks context precision
  • createHallucinationScorer - Detects hallucinated information
  • createBiasScorer - Identifies biased language
  • createToxicityScorer - Detects toxic content
  • createPromptAlignmentScorer - Checks instruction following
  • createToolCallAccuracyScorer - Evaluates tool usage

Code Scorers

Deterministic heuristics that don’t require external models:
import { createContentSimilarityScorer } from '@mastra/evals/scorers/prebuilt';

const similarity = createContentSimilarityScorer({ 
  ignoreCase: true 
});

const score = similarity.score({
  input: 'Hello world',
  output: 'hello world',
});

console.log(score.value); // 1.0 (100% similar)
Available Code Scorers:
  • createContentSimilarityScorer - Text similarity using Jaccard index
  • createTextualDifferenceScorer - Levenshtein distance
  • createKeywordCoverageScorer - Keyword presence check
  • createCompletenessScorer - Required elements coverage
  • createToneScorer - Sentiment analysis
  • createToolCallAccuracyScorerCode - Deterministic tool accuracy

Quick Start

Install

npm install @mastra/evals

Basic Example

import { createFaithfulnessScorer, createContentSimilarityScorer } from '@mastra/evals/scorers/prebuilt';

// LLM-based scorer
const faithfulness = createFaithfulnessScorer({
  model: 'gpt-4o-mini',
});

// Code-based scorer
const similarity = createContentSimilarityScorer({ 
  ignoreCase: true 
});

const answer = 'Paris is the capital of France.';
const context = ['France is in Europe', 'Paris is the capital of France'];

// Score with LLM
const faithfulnessScore = await faithfulness.score({ answer, context });

// Score with code
const similarityScore = similarity.score({
  input: context[0],
  output: answer,
});

console.log({ faithfulnessScore, similarityScore });
// {
//   faithfulnessScore: { value: 1.0, reason: '...' },
//   similarityScore: { value: 0.45 }
// }

Scoring Agent Runs

Score outputs from Mastra agent executions:
import { Mastra } from '@mastra/core';
import { createAnswerRelevancyScorer } from '@mastra/evals/scorers/prebuilt';
import { getUserMessageFromRunInput, getAssistantMessageFromRunOutput } from '@mastra/evals/scorers/utils';

const mastra = new Mastra({
  agents: {
    myAgent: {
      name: 'My Agent',
      instructions: 'You are helpful',
      model: 'gpt-4',
    },
  },
});

// Create scorer
const scorer = createAnswerRelevancyScorer({
  model: 'gpt-4o-mini',
});

// Run agent
const agent = mastra.getAgent('myAgent');
const result = await agent.generate({
  messages: [{ role: 'user', content: 'What is the capital of France?' }],
});

// Score the output
const score = await scorer.run({
  input: result.messages,  // Input messages
  output: result.messages, // Output messages
});

console.log(score);
// {
//   value: 0.95,
//   reason: 'The response directly answers the question'
// }

Evaluation Workflows

Batch Evaluation

Evaluate multiple test cases:
import { createFaithfulnessScorer } from '@mastra/evals/scorers/prebuilt';

const scorer = createFaithfulnessScorer({ model: 'gpt-4o-mini' });

const testCases = [
  {
    answer: 'Paris is the capital of France',
    context: ['Paris is the capital of France'],
  },
  {
    answer: 'London is the capital of France',
    context: ['Paris is the capital of France'],
  },
];

const results = await Promise.all(
  testCases.map(testCase => scorer.score(testCase))
);

results.forEach((result, i) => {
  console.log(`Test ${i + 1}: Score ${result.value}`);
});

A/B Testing

Compare different models or prompts:
import { createAnswerRelevancyScorer } from '@mastra/evals/scorers/prebuilt';

const scorer = createAnswerRelevancyScorer({ model: 'gpt-4o-mini' });

// Test model A
const resultA = await agentA.generate({
  messages: [{ role: 'user', content: 'What is AI?' }],
});

// Test model B
const resultB = await agentB.generate({
  messages: [{ role: 'user', content: 'What is AI?' }],
});

// Compare scores
const scoreA = await scorer.run({ input: resultA.messages, output: resultA.messages });
const scoreB = await scorer.run({ input: resultB.messages, output: resultB.messages });

console.log('Model A:', scoreA.value);
console.log('Model B:', scoreB.value);

Scorer Pipeline

Combine multiple scorers:
import {
  createFaithfulnessScorer,
  createAnswerRelevancyScorer,
  createToxicityScorer,
} from '@mastra/evals/scorers/prebuilt';

const scorers = [
  createFaithfulnessScorer({ model: 'gpt-4o-mini' }),
  createAnswerRelevancyScorer({ model: 'gpt-4o-mini' }),
  createToxicityScorer({ model: 'gpt-4o-mini' }),
];

const answer = 'Paris is the capital of France.';
const context = ['Paris is the capital of France'];
const question = 'What is the capital of France?';

const scores = await Promise.all([
  scorers[0].score({ answer, context }),
  scorers[1].score({ answer, question }),
  scorers[2].score({ answer }),
]);

console.log('Faithfulness:', scores[0].value);
console.log('Relevancy:', scores[1].value);
console.log('Toxicity:', scores[2].value);

Benefits

Quality Assurance

Catch quality issues before production

Continuous Improvement

Track metrics over time to improve agents

Model Comparison

Compare different models and configurations

Cost Optimization

Find the right balance of quality and cost

Integration with Observability

Scorers work seamlessly with Mastra’s observability system:
import { Mastra } from '@mastra/core';

const mastra = new Mastra({
  observability: { enabled: true },
  agents: { /* ... */ },
});

// Generate with tracing
const result = await agent.generate({
  messages: [{ role: 'user', content: 'Hello' }],
});

// Get trace
const trace = await mastra.getTrace(result.traceId!);

// Add score to trace
trace.addScore({
  name: 'answer_relevancy',
  value: 0.95,
  comment: 'Highly relevant response',
});

Next Steps

Creating Evals

Build custom evaluation workflows

Using Scorers

Learn about prebuilt scorers

Build docs developers (and LLMs) love