Using Scorers - Mastra

Mastra provides a comprehensive collection of prebuilt scorers for common evaluation tasks.

LLM-Based Scorers

Scorers that use a judge model to evaluate responses.

Faithfulness Scorer

Evaluates whether the response is supported by the provided context.

import { createFaithfulnessScorer } from '@mastra/evals/scorers/prebuilt';

const scorer = createFaithfulnessScorer({
  model: 'gpt-4o-mini',
  options: {
    scale: 1,  // Score range (default: 1 = 0-1)
    context: ['Paris is the capital of France'], // Optional: override context
  },
});

const result = await scorer.score({
  answer: 'Paris is the capital of France.',
  context: ['Paris is the capital of France', 'France is in Europe'],
});

console.log(result);
// {
//   value: 1.0,
//   reason: 'All claims are supported by the context'
// }

Answer Relevancy Scorer

Measures how relevant the answer is to the question.

import { createAnswerRelevancyScorer } from '@mastra/evals/scorers/prebuilt';

const scorer = createAnswerRelevancyScorer({
  model: 'gpt-4o-mini',
});

const result = await scorer.score({
  answer: 'Paris is the capital of France.',
  question: 'What is the capital of France?',
});

console.log(result.value); // 0.95

Context Relevance Scorer

Evaluates if the provided context is relevant to the question.

import { createContextRelevanceScorer } from '@mastra/evals/scorers/prebuilt';

const scorer = createContextRelevanceScorer({
  model: 'gpt-4o-mini',
});

const result = await scorer.score({
  question: 'What is the capital of France?',
  context: ['Paris is the capital of France', 'France is in Europe'],
});

console.log(result.value); // 1.0 (both context items are relevant)

Context Precision Scorer

Checks if relevant context appears before irrelevant context.

import { createContextPrecisionScorer } from '@mastra/evals/scorers/prebuilt';

const scorer = createContextPrecisionScorer({
  model: 'gpt-4o-mini',
});

const result = await scorer.score({
  question: 'What is the capital of France?',
  context: [
    'Paris is the capital of France',  // Relevant - good!
    'France is in Europe',              // Less relevant
  ],
  expectedAnswer: 'Paris',
});

Hallucination Scorer

Detects if the response contains hallucinated information.

import { createHallucinationScorer } from '@mastra/evals/scorers/prebuilt';

const scorer = createHallucinationScorer({
  model: 'gpt-4o-mini',
  options: {
    getContext: (run) => {
      // Extract context from tool results
      return run.output
        .flatMap(m => m.content?.toolInvocations || [])
        .map(t => JSON.stringify(t.result));
    },
  },
});

const result = await scorer.score({
  answer: 'The Eiffel Tower is in Berlin.',  // Hallucination!
  context: ['The Eiffel Tower is in Paris'],
});

console.log(result.value); // Low score (hallucination detected)

Bias Scorer

Identifies biased language in responses.

import { createBiasScorer } from '@mastra/evals/scorers/prebuilt';

const scorer = createBiasScorer({
  model: 'gpt-4o-mini',
});

const result = await scorer.score({
  answer: 'Engineers are naturally better at math than others.',
});

console.log(result.value); // Low score (bias detected)

Toxicity Scorer

Detects toxic, harmful, or inappropriate content.

import { createToxicityScorer } from '@mastra/evals/scorers/prebuilt';

const scorer = createToxicityScorer({
  model: 'gpt-4o-mini',
});

const result = await scorer.score({
  answer: 'Your response here',
});

console.log(result.value); // 0-1 (lower = less toxic)

Prompt Alignment Scorer

Checks if the response follows the given instructions.

import { createPromptAlignmentScorer } from '@mastra/evals/scorers/prebuilt';

const scorer = createPromptAlignmentScorer({
  model: 'gpt-4o-mini',
});

const result = await scorer.score({
  answer: 'Sure! Paris is the capital of France.',
  instructions: 'Answer in one word only',
});

console.log(result.value); // Low score (didn't follow instructions)

Tool Call Accuracy Scorer (LLM)

Evaluates if the agent used the correct tools.

import { createToolCallAccuracyScorer } from '@mastra/evals/scorers/prebuilt';

const scorer = createToolCallAccuracyScorer({
  model: 'gpt-4o-mini',
  options: {
    expectedTools: ['calculator', 'search'],
  },
});

const result = await scorer.run({
  input: messages,
  output: agentOutput,
});

Code-Based Scorers

Deterministic scorers that don’t require a judge model.

Content Similarity Scorer

Measures text similarity using Jaccard index.

import { createContentSimilarityScorer } from '@mastra/evals/scorers/prebuilt';

const scorer = createContentSimilarityScorer({
  ignoreCase: true,
  tokenizer: 'word',  // 'word' or 'char'
});

const result = scorer.score({
  input: 'Hello world',
  output: 'hello world',
});

console.log(result.value); // 1.0 (100% similar)

Textual Difference Scorer

Calculates Levenshtein distance between strings.

import { createTextualDifferenceScorer } from '@mastra/evals/scorers/prebuilt';

const scorer = createTextualDifferenceScorer({
  caseSensitive: false,
});

const result = scorer.score({
  input: 'Hello',
  output: 'Helo',
});

console.log(result.value); // 0.8 (1 character difference)

Keyword Coverage Scorer

Checks if specific keywords are present.

import { createKeywordCoverageScorer } from '@mastra/evals/scorers/prebuilt';

const scorer = createKeywordCoverageScorer({
  keywords: ['paris', 'france', 'capital'],
  caseSensitive: false,
});

const result = scorer.score({
  output: 'Paris is the capital of France',
});

console.log(result.value); // 1.0 (all keywords present)

Completeness Scorer

Ensures all required elements are included.

import { createCompletenessScorer } from '@mastra/evals/scorers/prebuilt';

const scorer = createCompletenessScorer({
  requiredElements: ['greeting', 'introduction', 'conclusion'],
});

const result = scorer.score({
  output: 'Hello! I am an AI assistant. Goodbye!',
});

console.log(result.value); // Score based on element coverage

Tone Scorer

Analyzes sentiment/tone of the response.

import { createToneScorer } from '@mastra/evals/scorers/prebuilt';

const scorer = createToneScorer({
  expectedTone: 'positive',  // 'positive', 'negative', or 'neutral'
});

const result = scorer.score({
  output: 'This is great!',
});

console.log(result.value); // High score (matches expected tone)

Tool Call Accuracy Scorer (Code)

Deterministic tool usage validation.

import { createToolCallAccuracyScorerCode } from '@mastra/evals/scorers/prebuilt';

const scorer = createToolCallAccuracyScorerCode({
  expectedTools: ['calculator'],
  strictOrder: false,  // Don't enforce order
});

const result = await scorer.run({
  input: messages,
  output: agentOutput,
});

Scoring Agent Runs

All scorers can evaluate Mastra agent runs:

import { Mastra } from '@mastra/core';
import { createAnswerRelevancyScorer } from '@mastra/evals/scorers/prebuilt';

const mastra = new Mastra({ /* config */ });
const scorer = createAnswerRelevancyScorer({ model: 'gpt-4o-mini' });

// Run agent
const agent = mastra.getAgent('myAgent');
const result = await agent.generate({
  messages: [{ role: 'user', content: 'What is AI?' }],
});

// Score the output
const score = await scorer.run({
  input: result.messages,
  output: result.messages,
});

console.log(score.value);

Combining Scorers

Use multiple scorers together:

import {
  createFaithfulnessScorer,
  createAnswerRelevancyScorer,
  createContentSimilarityScorer,
} from '@mastra/evals/scorers/prebuilt';

const scorers = {
  faithfulness: createFaithfulnessScorer({ model: 'gpt-4o-mini' }),
  relevancy: createAnswerRelevancyScorer({ model: 'gpt-4o-mini' }),
  similarity: createContentSimilarityScorer({ ignoreCase: true }),
};

const answer = 'Paris is the capital of France.';
const context = ['Paris is the capital of France'];
const question = 'What is the capital of France?';
const expected = 'Paris';

// Run all scorers
const [faithfulness, relevancy, similarity] = await Promise.all([
  scorers.faithfulness.score({ answer, context }),
  scorers.relevancy.score({ answer, question }),
  scorers.similarity.score({ input: expected, output: answer }),
]);

console.log({
  faithfulness: faithfulness.value,
  relevancy: relevancy.value,
  similarity: similarity.value,
});

Scorer Configuration

Model Selection

Choose appropriate judge models:

// Fast and cheap
const fastScorer = createFaithfulnessScorer({ model: 'gpt-4o-mini' });

// More accurate
const accurateScorer = createFaithfulnessScorer({ model: 'gpt-4o' });

// Use Anthropic
const claudeScorer = createFaithfulnessScorer({ model: 'claude-3-5-sonnet-latest' });

Scale Options

Adjust score ranges:

const scorer = createFaithfulnessScorer({
  model: 'gpt-4o-mini',
  options: {
    scale: 10,  // Scores from 0-10 instead of 0-1
  },
});

Best Practices

Choose the Right Scorer

Use LLM scorers for nuanced evaluation (relevancy, hallucination)
Use code scorers for deterministic checks (keywords, format)
Combine both for comprehensive evaluation

Optimize Costs

// Use cheaper models for simple evals
const quickCheck = createToxicityScorer({ model: 'gpt-4o-mini' });

// Use better models for critical evals
const criticalCheck = createHallucinationScorer({ model: 'gpt-4o' });

Batch Processing

const testCases = [ /* ... */ ];

// Process in parallel for speed
const results = await Promise.all(
  testCases.map(test => scorer.score(test))
);

Get Started

Core Concepts

Agents

Workflows

Memory

RAG

Tools & MCP

Storage

Server & API

Observability

Evals

Deployment

​LLM-Based Scorers

​Faithfulness Scorer

​Answer Relevancy Scorer

​Context Relevance Scorer

​Context Precision Scorer

​Hallucination Scorer

​Bias Scorer

​Toxicity Scorer

​Prompt Alignment Scorer

​Tool Call Accuracy Scorer (LLM)

​Code-Based Scorers

​Content Similarity Scorer

​Textual Difference Scorer

​Keyword Coverage Scorer

​Completeness Scorer

​Tone Scorer

​Tool Call Accuracy Scorer (Code)

​Scoring Agent Runs

​Combining Scorers

​Scorer Configuration

​Model Selection

​Scale Options

​Best Practices

​Choose the Right Scorer

​Optimize Costs

​Batch Processing

​Next Steps

Creating Evals