Evaluation System

The ADK evaluation system helps you measure and improve agent performance through structured testing, prebuilt metrics, and custom evaluators.

Overview

The evaluation system provides:

Evaluation datasets - Structured test cases with expected outputs
Prebuilt metrics - Ready-to-use evaluators for common tasks
Custom evaluators - Build domain-specific evaluation logic
Batch evaluation - Run tests across multiple cases
Detailed reports - Per-invocation and aggregate results

Quick Start

import { AgentEvaluator, PrebuiltMetrics } from "@iqai/adk";

const agent = new AgentBuilder()
  .withModel("gpt-4")
  .withTools([searchTool])
  .buildLlm();

// Define evaluation criteria
const criteria = {
  [PrebuiltMetrics.TOOL_TRAJECTORY_AVG_SCORE]: 1.0,
  [PrebuiltMetrics.RESPONSE_MATCH_SCORE]: 0.8,
};

// Run evaluation
await AgentEvaluator.evaluate(
  agent,
  "./test-data.json", // Path to test file or directory
  2 // Number of runs per test case
);

Evaluation Data Format

EvalSet

An EvalSet is a collection of test cases:

interface EvalSet {
  evalSetId: string;
  name?: string;
  description?: string;
  evalCases: EvalCase[];
  creationTimestamp: number;
}

EvalCase

Each test case contains a conversation with expected behavior:

interface EvalCase {
  evalId: string;
  conversation: Invocation[];
  sessionInput?: SessionInput; // Initial session state
}

Invocation

An invocation represents a single user message and expected response:

interface Invocation {
  invocationId?: string;
  userContent: Content; // User message
  finalResponse?: Content; // Expected response
  intermediateData?: IntermediateData; // Expected tool calls
  creationTimestamp: number;
}

Example Dataset

{
  "evalSetId": "search-agent-v1",
  "name": "Search Agent Tests",
  "evalCases": [
    {
      "evalId": "test-1",
      "conversation": [
        {
          "invocationId": "inv-1",
          "userContent": {
            "role": "user",
            "parts": [{ "text": "Search for the latest AI news" }]
          },
          "finalResponse": {
            "role": "model",
            "parts": [{ "text": "Here are the latest AI developments..." }]
          },
          "intermediateData": {
            "toolUses": [
              {
                "name": "search",
                "args": { "query": "latest AI news" }
              }
            ],
            "intermediateResponses": []
          },
          "creationTimestamp": 1234567890
        }
      ]
    }
  ],
  "creationTimestamp": 1234567890
}

Prebuilt Metrics

ADK includes several ready-to-use evaluation metrics:

TOOL_TRAJECTORY_AVG_SCORE

Evaluates if the agent used the expected tools:

const criteria = {
  [PrebuiltMetrics.TOOL_TRAJECTORY_AVG_SCORE]: 1.0, // Perfect match required
};

Measures: Precision and recall of tool calls (function names only)

RESPONSE_MATCH_SCORE

Compares agent response to reference using ROUGE metrics:

const criteria = {
  [PrebuiltMetrics.RESPONSE_MATCH_SCORE]: 0.8, // 80% similarity
};

Measures: Text similarity using ROUGE-1, ROUGE-2, and ROUGE-L

FINAL_RESPONSE_MATCH_V2

Advanced response matching with configurable scoring:

const criteria = {
  [PrebuiltMetrics.FINAL_RESPONSE_MATCH_V2]: 0.75,
};

Measures: Combines multiple similarity metrics with customizable weights

SAFETY_V1

Evaluates response safety using LLM-as-judge:

const criteria = {
  [PrebuiltMetrics.SAFETY_V1]: 1.0, // All responses must be safe
};

Measures: Content safety across multiple dimensions (harm, bias, etc.)

AgentEvaluator

The main interface for running evaluations:

evaluate()

Run evaluation on test files:

static async evaluate(
  agent: BaseAgent,
  evalDatasetFilePathOrDir: string,
  numRuns: number = 2,
  initialSessionFile?: string
): Promise<void>

Example:

await AgentEvaluator.evaluate(
  agent,
  "./tests", // Finds all *.test.json files
  3, // Run each test 3 times
  "./initial-session.json" // Optional initial state
);

evaluateEvalSet()

Evaluate a specific eval set:

static async evaluateEvalSet(
  agent: BaseAgent,
  evalSet: EvalSet,
  criteria: Record<string, number>,
  numRuns: number = 2,
  printDetailedResults: boolean = false
): Promise<void>

Example:

const evalSet: EvalSet = {
  evalSetId: "my-tests",
  evalCases: [...],
  creationTimestamp: Date.now(),
};

await AgentEvaluator.evaluateEvalSet(
  agent,
  evalSet,
  {
    [PrebuiltMetrics.TOOL_TRAJECTORY_AVG_SCORE]: 1.0,
    [PrebuiltMetrics.RESPONSE_MATCH_SCORE]: 0.75,
  },
  2, // Number of runs
  true // Print detailed results
);

Test Configuration

Create a test_config.json file alongside your test data:

{
  "criteria": {
    "tool_trajectory_avg_score": 1.0,
    "response_match_score": 0.8,
    "safety_v1": 1.0
  }
}

The evaluator automatically finds and uses this config.

Custom Evaluators

Build custom evaluation logic by extending the Evaluator class:

Define your metric

import { Evaluator, EvaluationResult, Invocation } from "@iqai/adk";

export class CustomEvaluator extends Evaluator {
  constructor(threshold: number = 0.8) {
    super({
      metricName: "custom_metric",
      threshold,
    });
  }
}

Implement evaluation logic

async evaluateInvocations(
  actualInvocations: Invocation[],
  expectedInvocations: Invocation[]
): Promise<EvaluationResult> {
  const perInvocationResults = [];
  
  for (let i = 0; i < actualInvocations.length; i++) {
    const actual = actualInvocations[i];
    const expected = expectedInvocations[i];
    
    // Your evaluation logic here
    const score = this.calculateScore(actual, expected);
    const passed = score >= this.metric.threshold;
    
    perInvocationResults.push({
      actualInvocation: actual,
      expectedInvocation: expected,
      score,
      evalStatus: passed ? EvalStatus.PASSED : EvalStatus.FAILED,
    });
  }
  
  const overallScore = perInvocationResults.reduce(
    (sum, r) => sum + (r.score || 0), 0
  ) / perInvocationResults.length;
  
  return {
    overallScore,
    overallEvalStatus: overallScore >= this.metric.threshold 
      ? EvalStatus.PASSED 
      : EvalStatus.FAILED,
    perInvocationResults,
  };
}

Add metric metadata

static getMetricInfo(): MetricInfo {
  return {
    metricName: "custom_metric",
    description: "Evaluates custom business logic",
    defaultThreshold: 0.8,
    experimental: false,
    metricValueInfo: {
      interval: {
        minValue: 0,
        openAtMin: false,
        maxValue: 1,
        openAtMax: false,
      },
    },
  };
}

Example: Intent Matching Evaluator

import { Evaluator, EvalStatus } from "@iqai/adk";
import type { Invocation, EvaluationResult } from "@iqai/adk";

export class IntentMatchEvaluator extends Evaluator {
  constructor(threshold: number = 1.0) {
    super({ metricName: "intent_match", threshold });
  }

  async evaluateInvocations(
    actualInvocations: Invocation[],
    expectedInvocations: Invocation[]
  ): Promise<EvaluationResult> {
    const results = [];

    for (let i = 0; i < actualInvocations.length; i++) {
      const actual = actualInvocations[i];
      const expected = expectedInvocations[i];

      // Extract intents (simplified)
      const actualIntent = this.extractIntent(actual);
      const expectedIntent = this.extractIntent(expected);

      const score = actualIntent === expectedIntent ? 1.0 : 0.0;

      results.push({
        actualInvocation: actual,
        expectedInvocation: expected,
        score,
        evalStatus: score >= this.metric.threshold 
          ? EvalStatus.PASSED 
          : EvalStatus.FAILED,
      });
    }

    const avgScore = results.reduce((sum, r) => sum + r.score!, 0) / results.length;

    return {
      overallScore: avgScore,
      overallEvalStatus: avgScore >= this.metric.threshold
        ? EvalStatus.PASSED
        : EvalStatus.FAILED,
      perInvocationResults: results,
    };
  }

  private extractIntent(invocation: Invocation): string {
    // Implement your intent extraction logic
    const text = invocation.userContent.parts?.[0]?.text || "";
    // Simple keyword-based intent detection
    if (text.includes("search")) return "search";
    if (text.includes("book")) return "booking";
    return "unknown";
  }

  static getMetricInfo() {
    return {
      metricName: "intent_match",
      description: "Matches user intent against expected intent",
      defaultThreshold: 1.0,
    };
  }
}

Prebuilt Evaluators

ADK provides several evaluator implementations:

TrajectoryEvaluator

Compares tool usage between actual and expected:

import { TrajectoryEvaluator } from "@iqai/adk";

const evaluator = new TrajectoryEvaluator(1.0); // Threshold
const result = await evaluator.evaluateInvocations(
  actualInvocations,
  expectedInvocations
);

RougeEvaluator

Compares text using ROUGE metrics:

import { RougeEvaluator } from "@iqai/adk";

const evaluator = new RougeEvaluator(0.8); // 80% similarity
const result = await evaluator.evaluateInvocations(
  actualInvocations,
  expectedInvocations
);

FinalResponseMatchV2Evaluator

Advanced response matching:

import { FinalResponseMatchV2Evaluator } from "@iqai/adk";

const evaluator = new FinalResponseMatchV2Evaluator(0.75);
const result = await evaluator.evaluateInvocations(
  actualInvocations,
  expectedInvocations
);

SafetyEvaluatorV1

LLM-based safety evaluation:

import { SafetyEvaluatorV1 } from "@iqai/adk";

const evaluator = new SafetyEvaluatorV1(1.0, {
  judgeModel: "gpt-4",
  numSamples: 1,
});
const result = await evaluator.evaluateInvocations(
  actualInvocations,
  expectedInvocations
);

Evaluation Results

EvalResult

Results include per-case and aggregate metrics:

interface EvalCaseResult {
  evalId: string;
  evalMetricResultPerInvocation: EvalMetricResultPerInvocation[];
}

interface EvalMetricResultPerInvocation {
  actualInvocation: Invocation;
  expectedInvocation: Invocation;
  evalMetricResults: EvalMetricResult[];
}

interface EvalMetricResult {
  metricName: string;
  threshold: number;
  score?: number;
  evalStatus: EvalStatus;
}

Reading Results

try {
  await AgentEvaluator.evaluate(agent, "./tests", 2);
  console.log("All tests passed!");
} catch (error) {
  console.error("Test failures:", error.message);
  // Re-run with detailed output
  await AgentEvaluator.evaluateEvalSet(
    agent,
    evalSet,
    criteria,
    1,
    true // Print detailed results
  );
}

Advanced Usage

Custom Judge Models

Use different models for LLM-as-judge metrics:

const criteria = {
  [PrebuiltMetrics.SAFETY_V1]: 1.0,
};

const evalMetrics = [
  {
    metricName: PrebuiltMetrics.SAFETY_V1,
    threshold: 1.0,
    judgeModelOptions: {
      judgeModel: "claude-3-opus",
      judgeModelConfig: {
        temperature: 0.0,
      },
      numSamples: 3, // Sample multiple times for consistency
    },
  },
];

Programmatic Evaluation

Run evaluation without the AgentEvaluator helper:

import { LocalEvalService } from "@iqai/adk";

const evalService = new LocalEvalService(agent);

// Run inference
const inferenceResults = [];
for await (const result of evalService.performInference({
  evalSetId: evalSet.evalSetId,
  evalCases: [evalSet],
})) {
  inferenceResults.push(result);
}

// Evaluate results
for await (const evalResult of evalService.evaluate({
  inferenceResults,
  evaluateConfig: {
    evalMetrics: [
      { metricName: "response_match_score", threshold: 0.8 },
    ],
  },
})) {
  console.log("Eval result:", evalResult);
}

Migrating Old Test Data

Convert old format test data to the new EvalSet schema:

await AgentEvaluator.migrateEvalDataToNewSchema(
  "./old-tests.json",
  "./new-tests.json",
  "./initial-session.json" // Optional
);

Best Practices

Run multiple iterations - Run each test 2-3 times to account for LLM non-determinism.

Use version control - Track test datasets and results in version control.

Start with prebuilt metrics - Use provided evaluators before building custom ones.

Set realistic thresholds - Don’t expect 100% scores on fuzzy metrics like response matching.

Combine metrics - Use multiple metrics together for comprehensive evaluation.

Large test suites can consume significant API quota. Consider costs when running frequent evaluations.

Getting Started

Core Concepts

Agents

Models & Providers

Tools

Memory & State

Advanced Features

CLI Tool

Examples

​Overview

​Quick Start

​Evaluation Data Format

​EvalSet

​EvalCase

​Invocation

​Example Dataset

​Prebuilt Metrics

​TOOL_TRAJECTORY_AVG_SCORE

​RESPONSE_MATCH_SCORE

​FINAL_RESPONSE_MATCH_V2

​SAFETY_V1

​AgentEvaluator

​evaluate()

​evaluateEvalSet()

​Test Configuration

​Custom Evaluators

​Example: Intent Matching Evaluator

​Prebuilt Evaluators

​TrajectoryEvaluator

​RougeEvaluator

​FinalResponseMatchV2Evaluator

​SafetyEvaluatorV1

​Evaluation Results

​EvalResult

​Reading Results

​Advanced Usage

​Custom Judge Models

​Programmatic Evaluation

​Migrating Old Test Data

​Best Practices

​See Also

Build docs developers (and LLMs) love

Overview

Quick Start

Evaluation Data Format

EvalSet

EvalCase

Invocation

Example Dataset

Prebuilt Metrics

TOOL_TRAJECTORY_AVG_SCORE

RESPONSE_MATCH_SCORE

FINAL_RESPONSE_MATCH_V2

SAFETY_V1

AgentEvaluator

evaluate()

evaluateEvalSet()

Test Configuration

Custom Evaluators

Example: Intent Matching Evaluator

Prebuilt Evaluators

TrajectoryEvaluator

RougeEvaluator

FinalResponseMatchV2Evaluator

SafetyEvaluatorV1

Evaluation Results

EvalResult

Reading Results

Advanced Usage

Custom Judge Models

Programmatic Evaluation

Migrating Old Test Data

Best Practices

See Also