Evaluation

Evaluation helps you measure and improve the quality of your AI applications. Genkit provides tools for scoring outputs, running test datasets, and tracking performance over time.

Why Evaluate?

AI outputs can be unpredictable. Evaluation helps you:

Measure quality - Quantify how well your AI performs
Catch regressions - Detect when changes make outputs worse
Compare approaches - Test different models, prompts, or parameters
Track improvements - Monitor quality over time

Built-in Evaluators

Genkit includes several built-in evaluators:

import "github.com/firebase/genkit/go/plugins/evaluators"

metrics := []evaluators.MetricConfig{
    {
        MetricType: evaluators.EvaluatorDeepEqual,
    },
    {
        MetricType: evaluators.EvaluatorRegex,
    },
    {
        MetricType: evaluators.EvaluatorJsonata,
    },
}

g := genkit.Init(ctx, genkit.WithPlugins(
    &googlegenai.GoogleAI{},
    &evaluators.GenkitEval{Metrics: metrics},
))

DeepEqual

Checks if the output exactly matches an expected value:

{
  "expected": "Paris is the capital of France",
  "actual": "Paris is the capital of France",
  "score": 1.0
}

Regex

Matches output against a regular expression:

{
  "pattern": "capital.*France",
  "actual": "The capital of France is Paris",
  "score": 1.0
}

JSONata

Queries structured output using JSONata:

{
  "query": "$.ingredients[0].name",
  "expected": "flour",
  "score": 1.0
}

Custom Evaluators

Create custom evaluators for your specific needs:

import (
    "github.com/firebase/genkit/go/ai"
    "github.com/firebase/genkit/go/core/api"
    "github.com/firebase/genkit/go/genkit"
)

evalOptions := ai.EvaluatorOptions{
    DisplayName: "Simple Evaluator",
    Definition:  "Checks if output contains specific keywords",
    IsBilled:    false,
}

genkit.DefineEvaluator(g, api.NewName("custom", "keywordChecker"), &evalOptions, 
    func(ctx context.Context, req *ai.EvaluatorCallbackRequest) (*ai.EvaluatorCallbackResponse, error) {
        // Check if output contains required keywords
        output := req.Input.Output.(string)
        keywords := []string{"Paris", "France"}
        
        foundAll := true
        for _, keyword := range keywords {
            if !strings.Contains(output, keyword) {
                foundAll = false
                break
            }
        }
        
        score := ai.Score{
            Id:      "keyword_match",
            Score:   foundAll,
            Status:  ai.ScoreStatusPass.String(),
            Details: map[string]any{
                "reasoning": fmt.Sprintf("Found all keywords: %v", foundAll),
            },
        }
        
        return &ai.EvaluatorCallbackResponse{
            TestCaseId: req.Input.TestCaseId,
            Evaluation: []ai.Score{score},
        }, nil
    })

Batch Evaluators

Process multiple test cases efficiently:

genkit.DefineBatchEvaluator(g, api.NewName("custom", "batchChecker"), &evalOptions, 
    func(ctx context.Context, req *ai.EvaluatorRequest) (*ai.EvaluatorResponse, error) {
        var evalResponses []ai.EvaluationResult
        
        for _, datapoint := range req.Dataset {
            score := ai.Score{
                Id:      "testScore",
                Score:   evaluateDatapoint(datapoint),
                Status:  ai.ScoreStatusPass.String(),
                Details: map[string]any{
                    "reasoning": fmt.Sprintf("Evaluated: %s", datapoint.Input),
                },
            }
            
            evalResponses = append(evalResponses, ai.EvaluationResult{
                TestCaseId: datapoint.TestCaseId,
                Evaluation: []ai.Score{score},
            })
        }
        
        return &evalResponses, nil
    })

Using the Developer UI

The Genkit Developer UI provides visual evaluation tools:

Run the Dev UI:
```
genkit start -- go run main.go
```
Navigate to Evaluate tab

Create a test dataset:

[
  {
    "testCaseId": "test1",
    "input": "What is the capital of France?",
    "expected": "Paris"
  },
  {
    "testCaseId": "test2",
    "input": "What is the capital of Japan?",
    "expected": "Tokyo"
  }
]

Run evaluation and view results with detailed traces

Programmatic Evaluation

Evaluate flows programmatically in your tests:

import { genkit } from 'genkit';

const testCases = [
  { input: 'What is 2+2?', expected: '4' },
  { input: 'What is the capital of France?', expected: 'Paris' },
];

for (const testCase of testCases) {
  const result = await myFlow(testCase.input);
  const passed = result.includes(testCase.expected);
  
  console.log(`Test: ${testCase.input}`);
  console.log(`Expected: ${testCase.expected}`);
  console.log(`Result: ${result}`);
  console.log(`Passed: ${passed}\n`);
}

Evaluation Metrics

Accuracy

Measure exact match rate:

TypeScript

function calculateAccuracy(results: Array<{ expected: string, actual: string }>) {
  const correct = results.filter(r => r.actual === r.expected).length;
  return correct / results.length;
}

Semantic Similarity

Use embeddings to measure semantic similarity:

import { googleAI } from '@genkit-ai/google-genai';

function cosineSimilarity(a: number[], b: number[]): number {
  const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dotProduct / (magnitudeA * magnitudeB);
}

async function semanticSimilarity(
  expected: string,
  actual: string
): Promise<number> {
  const [expectedEmbed, actualEmbed] = await Promise.all([
    ai.embed({ embedder: textEmbedding004, content: expected }),
    ai.embed({ embedder: textEmbedding004, content: actual }),
  ]);
  
  return cosineSimilarity(expectedEmbed.values, actualEmbed.values);
}

Retrieval Metrics (RAG)

For RAG applications, measure retrieval quality:

TypeScript

interface RetrievalMetrics {
  precision: number;  // Relevant docs / Retrieved docs
  recall: number;     // Relevant docs / Total relevant docs
  mrr: number;        // Mean Reciprocal Rank
}

function calculateRetrievalMetrics(
  retrieved: string[],
  relevant: string[]
): RetrievalMetrics {
  const relevantSet = new Set(relevant);
  const relevantRetrieved = retrieved.filter(doc => relevantSet.has(doc));
  
  const precision = relevantRetrieved.length / retrieved.length;
  const recall = relevantRetrieved.length / relevant.length;
  
  // Mean Reciprocal Rank
  let mrr = 0;
  for (let i = 0; i < retrieved.length; i++) {
    if (relevantSet.has(retrieved[i])) {
      mrr = 1 / (i + 1);
      break;
    }
  }
  
  return { precision, recall, mrr };
}

A/B Testing

Compare different approaches:

async function compareModels(
  testCases: Array<{ input: string, expected: string }>
) {
  const results = {
    'gemini-2.5-flash': [],
    'gemini-2.5-pro': [],
  };
  
  for (const testCase of testCases) {
    for (const model of Object.keys(results)) {
      const { text } = await ai.generate({
        model: googleAI.model(model),
        prompt: testCase.input,
      });
      
      const score = await evaluateOutput(text, testCase.expected);
      results[model].push(score);
    }
  }
  
  // Calculate averages
  const summary = {};
  for (const [model, scores] of Object.entries(results)) {
    summary[model] = scores.reduce((a, b) => a + b, 0) / scores.length;
  }
  
  return summary;
}

Best Practices

Create Diverse Test Sets

Cover various scenarios:

TypeScript

const testCases = [
  // Happy path
  { input: 'What is the capital of France?', expected: 'Paris' },
  
  // Edge cases
  { input: 'What is the capital of a country that doesn\'t exist?', expected: 'unknown' },
  
  // Ambiguous inputs
  { input: 'capital', expected: 'clarification' },
  
  // Different phrasings
  { input: 'France\'s capital city?', expected: 'Paris' },
];

Track Metrics Over Time

Store evaluation results:

TypeScript

interface EvaluationResult {
  timestamp: Date;
  modelVersion: string;
  averageScore: number;
  testCases: number;
}

const results: EvaluationResult[] = [];

async function runAndTrackEvaluation() {
  const scores = await runEvaluation();
  
  results.push({
    timestamp: new Date(),
    modelVersion: 'gemini-2.5-flash',
    averageScore: scores.reduce((a, b) => a + b, 0) / scores.length,
    testCases: scores.length,
  });
  
  // Save to database
  await saveResults(results);
}

Automate Evaluation in CI/CD

Run evaluations automatically:

# .github/workflows/evaluate.yml
name: Evaluate AI Quality

on:
  pull_request:
  schedule:
    - cron: '0 0 * * *'  # Daily

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - run: npm install
      - run: npm run evaluate
      - name: Check quality threshold
        run: |
          if [ $(cat results.json | jq '.averageScore') < 0.8 ]; then
            echo "Quality below threshold"
            exit 1
          fi

Use Human Evaluation

For subjective qualities, involve humans:

TypeScript

interface HumanEvaluation {
  testCaseId: string;
  output: string;
  ratings: {
    accuracy: number;      // 1-5
    helpfulness: number;   // 1-5
    tone: number;          // 1-5
  };
  feedback: string;
}

// Present outputs to human reviewers
// Collect ratings and feedback
// Use to improve prompts and fine-tune models

Next Steps

Learn about Flows for production deployment
Explore Observability for monitoring
Check out Developer Tools for testing

Overview

Getting Started

Core Concepts

Guides

Model Providers

Deployment

Developer Tools

Evaluation

Evaluation

Why Evaluate?

Built-in Evaluators

DeepEqual

Regex

JSONata

Custom Evaluators

Batch Evaluators

Using the Developer UI

Programmatic Evaluation

Evaluation Metrics

Accuracy

Semantic Similarity

Retrieval Metrics (RAG)

A/B Testing

Best Practices

Create Diverse Test Sets

Track Metrics Over Time

Automate Evaluation in CI/CD

Use Human Evaluation

Next Steps

Build docs developers (and LLMs) love

Overview

Getting Started

Core Concepts

Guides

Model Providers

Deployment

Developer Tools

​Evaluation

​Why Evaluate?

​Built-in Evaluators

​DeepEqual

​Regex

​JSONata

​Custom Evaluators

​Batch Evaluators

​Using the Developer UI

​Programmatic Evaluation

​Evaluation Metrics

​Accuracy

​Semantic Similarity

​Retrieval Metrics (RAG)

​A/B Testing

​Best Practices

​Create Diverse Test Sets

​Track Metrics Over Time

​Automate Evaluation in CI/CD

​Use Human Evaluation

​Next Steps

Build docs developers (and LLMs) love

Evaluation

Why Evaluate?

Built-in Evaluators

DeepEqual

Regex

JSONata

Custom Evaluators

Batch Evaluators

Using the Developer UI

Programmatic Evaluation

Evaluation Metrics

Accuracy

Semantic Similarity

Retrieval Metrics (RAG)

A/B Testing

Best Practices

Create Diverse Test Sets

Track Metrics Over Time

Automate Evaluation in CI/CD

Use Human Evaluation

Next Steps