Skip to main content
The ADK evaluation system helps you measure and improve agent performance through structured testing, prebuilt metrics, and custom evaluators.

Overview

The evaluation system provides:
  • Evaluation datasets - Structured test cases with expected outputs
  • Prebuilt metrics - Ready-to-use evaluators for common tasks
  • Custom evaluators - Build domain-specific evaluation logic
  • Batch evaluation - Run tests across multiple cases
  • Detailed reports - Per-invocation and aggregate results

Quick Start

import { AgentEvaluator, PrebuiltMetrics } from "@iqai/adk";

const agent = new AgentBuilder()
  .withModel("gpt-4")
  .withTools([searchTool])
  .buildLlm();

// Define evaluation criteria
const criteria = {
  [PrebuiltMetrics.TOOL_TRAJECTORY_AVG_SCORE]: 1.0,
  [PrebuiltMetrics.RESPONSE_MATCH_SCORE]: 0.8,
};

// Run evaluation
await AgentEvaluator.evaluate(
  agent,
  "./test-data.json", // Path to test file or directory
  2 // Number of runs per test case
);

Evaluation Data Format

EvalSet

An EvalSet is a collection of test cases:
interface EvalSet {
  evalSetId: string;
  name?: string;
  description?: string;
  evalCases: EvalCase[];
  creationTimestamp: number;
}

EvalCase

Each test case contains a conversation with expected behavior:
interface EvalCase {
  evalId: string;
  conversation: Invocation[];
  sessionInput?: SessionInput; // Initial session state
}

Invocation

An invocation represents a single user message and expected response:
interface Invocation {
  invocationId?: string;
  userContent: Content; // User message
  finalResponse?: Content; // Expected response
  intermediateData?: IntermediateData; // Expected tool calls
  creationTimestamp: number;
}

Example Dataset

{
  "evalSetId": "search-agent-v1",
  "name": "Search Agent Tests",
  "evalCases": [
    {
      "evalId": "test-1",
      "conversation": [
        {
          "invocationId": "inv-1",
          "userContent": {
            "role": "user",
            "parts": [{ "text": "Search for the latest AI news" }]
          },
          "finalResponse": {
            "role": "model",
            "parts": [{ "text": "Here are the latest AI developments..." }]
          },
          "intermediateData": {
            "toolUses": [
              {
                "name": "search",
                "args": { "query": "latest AI news" }
              }
            ],
            "intermediateResponses": []
          },
          "creationTimestamp": 1234567890
        }
      ]
    }
  ],
  "creationTimestamp": 1234567890
}

Prebuilt Metrics

ADK includes several ready-to-use evaluation metrics:

TOOL_TRAJECTORY_AVG_SCORE

Evaluates if the agent used the expected tools:
const criteria = {
  [PrebuiltMetrics.TOOL_TRAJECTORY_AVG_SCORE]: 1.0, // Perfect match required
};
Measures: Precision and recall of tool calls (function names only)

RESPONSE_MATCH_SCORE

Compares agent response to reference using ROUGE metrics:
const criteria = {
  [PrebuiltMetrics.RESPONSE_MATCH_SCORE]: 0.8, // 80% similarity
};
Measures: Text similarity using ROUGE-1, ROUGE-2, and ROUGE-L

FINAL_RESPONSE_MATCH_V2

Advanced response matching with configurable scoring:
const criteria = {
  [PrebuiltMetrics.FINAL_RESPONSE_MATCH_V2]: 0.75,
};
Measures: Combines multiple similarity metrics with customizable weights

SAFETY_V1

Evaluates response safety using LLM-as-judge:
const criteria = {
  [PrebuiltMetrics.SAFETY_V1]: 1.0, // All responses must be safe
};
Measures: Content safety across multiple dimensions (harm, bias, etc.)

AgentEvaluator

The main interface for running evaluations:

evaluate()

Run evaluation on test files:
static async evaluate(
  agent: BaseAgent,
  evalDatasetFilePathOrDir: string,
  numRuns: number = 2,
  initialSessionFile?: string
): Promise<void>
Example:
await AgentEvaluator.evaluate(
  agent,
  "./tests", // Finds all *.test.json files
  3, // Run each test 3 times
  "./initial-session.json" // Optional initial state
);

evaluateEvalSet()

Evaluate a specific eval set:
static async evaluateEvalSet(
  agent: BaseAgent,
  evalSet: EvalSet,
  criteria: Record<string, number>,
  numRuns: number = 2,
  printDetailedResults: boolean = false
): Promise<void>
Example:
const evalSet: EvalSet = {
  evalSetId: "my-tests",
  evalCases: [...],
  creationTimestamp: Date.now(),
};

await AgentEvaluator.evaluateEvalSet(
  agent,
  evalSet,
  {
    [PrebuiltMetrics.TOOL_TRAJECTORY_AVG_SCORE]: 1.0,
    [PrebuiltMetrics.RESPONSE_MATCH_SCORE]: 0.75,
  },
  2, // Number of runs
  true // Print detailed results
);

Test Configuration

Create a test_config.json file alongside your test data:
{
  "criteria": {
    "tool_trajectory_avg_score": 1.0,
    "response_match_score": 0.8,
    "safety_v1": 1.0
  }
}
The evaluator automatically finds and uses this config.

Custom Evaluators

Build custom evaluation logic by extending the Evaluator class:
1

Define your metric

import { Evaluator, EvaluationResult, Invocation } from "@iqai/adk";

export class CustomEvaluator extends Evaluator {
  constructor(threshold: number = 0.8) {
    super({
      metricName: "custom_metric",
      threshold,
    });
  }
}
2

Implement evaluation logic

async evaluateInvocations(
  actualInvocations: Invocation[],
  expectedInvocations: Invocation[]
): Promise<EvaluationResult> {
  const perInvocationResults = [];
  
  for (let i = 0; i < actualInvocations.length; i++) {
    const actual = actualInvocations[i];
    const expected = expectedInvocations[i];
    
    // Your evaluation logic here
    const score = this.calculateScore(actual, expected);
    const passed = score >= this.metric.threshold;
    
    perInvocationResults.push({
      actualInvocation: actual,
      expectedInvocation: expected,
      score,
      evalStatus: passed ? EvalStatus.PASSED : EvalStatus.FAILED,
    });
  }
  
  const overallScore = perInvocationResults.reduce(
    (sum, r) => sum + (r.score || 0), 0
  ) / perInvocationResults.length;
  
  return {
    overallScore,
    overallEvalStatus: overallScore >= this.metric.threshold 
      ? EvalStatus.PASSED 
      : EvalStatus.FAILED,
    perInvocationResults,
  };
}
3

Add metric metadata

static getMetricInfo(): MetricInfo {
  return {
    metricName: "custom_metric",
    description: "Evaluates custom business logic",
    defaultThreshold: 0.8,
    experimental: false,
    metricValueInfo: {
      interval: {
        minValue: 0,
        openAtMin: false,
        maxValue: 1,
        openAtMax: false,
      },
    },
  };
}

Example: Intent Matching Evaluator

import { Evaluator, EvalStatus } from "@iqai/adk";
import type { Invocation, EvaluationResult } from "@iqai/adk";

export class IntentMatchEvaluator extends Evaluator {
  constructor(threshold: number = 1.0) {
    super({ metricName: "intent_match", threshold });
  }

  async evaluateInvocations(
    actualInvocations: Invocation[],
    expectedInvocations: Invocation[]
  ): Promise<EvaluationResult> {
    const results = [];

    for (let i = 0; i < actualInvocations.length; i++) {
      const actual = actualInvocations[i];
      const expected = expectedInvocations[i];

      // Extract intents (simplified)
      const actualIntent = this.extractIntent(actual);
      const expectedIntent = this.extractIntent(expected);

      const score = actualIntent === expectedIntent ? 1.0 : 0.0;

      results.push({
        actualInvocation: actual,
        expectedInvocation: expected,
        score,
        evalStatus: score >= this.metric.threshold 
          ? EvalStatus.PASSED 
          : EvalStatus.FAILED,
      });
    }

    const avgScore = results.reduce((sum, r) => sum + r.score!, 0) / results.length;

    return {
      overallScore: avgScore,
      overallEvalStatus: avgScore >= this.metric.threshold
        ? EvalStatus.PASSED
        : EvalStatus.FAILED,
      perInvocationResults: results,
    };
  }

  private extractIntent(invocation: Invocation): string {
    // Implement your intent extraction logic
    const text = invocation.userContent.parts?.[0]?.text || "";
    // Simple keyword-based intent detection
    if (text.includes("search")) return "search";
    if (text.includes("book")) return "booking";
    return "unknown";
  }

  static getMetricInfo() {
    return {
      metricName: "intent_match",
      description: "Matches user intent against expected intent",
      defaultThreshold: 1.0,
    };
  }
}

Prebuilt Evaluators

ADK provides several evaluator implementations:

TrajectoryEvaluator

Compares tool usage between actual and expected:
import { TrajectoryEvaluator } from "@iqai/adk";

const evaluator = new TrajectoryEvaluator(1.0); // Threshold
const result = await evaluator.evaluateInvocations(
  actualInvocations,
  expectedInvocations
);

RougeEvaluator

Compares text using ROUGE metrics:
import { RougeEvaluator } from "@iqai/adk";

const evaluator = new RougeEvaluator(0.8); // 80% similarity
const result = await evaluator.evaluateInvocations(
  actualInvocations,
  expectedInvocations
);

FinalResponseMatchV2Evaluator

Advanced response matching:
import { FinalResponseMatchV2Evaluator } from "@iqai/adk";

const evaluator = new FinalResponseMatchV2Evaluator(0.75);
const result = await evaluator.evaluateInvocations(
  actualInvocations,
  expectedInvocations
);

SafetyEvaluatorV1

LLM-based safety evaluation:
import { SafetyEvaluatorV1 } from "@iqai/adk";

const evaluator = new SafetyEvaluatorV1(1.0, {
  judgeModel: "gpt-4",
  numSamples: 1,
});
const result = await evaluator.evaluateInvocations(
  actualInvocations,
  expectedInvocations
);

Evaluation Results

EvalResult

Results include per-case and aggregate metrics:
interface EvalCaseResult {
  evalId: string;
  evalMetricResultPerInvocation: EvalMetricResultPerInvocation[];
}

interface EvalMetricResultPerInvocation {
  actualInvocation: Invocation;
  expectedInvocation: Invocation;
  evalMetricResults: EvalMetricResult[];
}

interface EvalMetricResult {
  metricName: string;
  threshold: number;
  score?: number;
  evalStatus: EvalStatus;
}

Reading Results

try {
  await AgentEvaluator.evaluate(agent, "./tests", 2);
  console.log("All tests passed!");
} catch (error) {
  console.error("Test failures:", error.message);
  // Re-run with detailed output
  await AgentEvaluator.evaluateEvalSet(
    agent,
    evalSet,
    criteria,
    1,
    true // Print detailed results
  );
}

Advanced Usage

Custom Judge Models

Use different models for LLM-as-judge metrics:
const criteria = {
  [PrebuiltMetrics.SAFETY_V1]: 1.0,
};

const evalMetrics = [
  {
    metricName: PrebuiltMetrics.SAFETY_V1,
    threshold: 1.0,
    judgeModelOptions: {
      judgeModel: "claude-3-opus",
      judgeModelConfig: {
        temperature: 0.0,
      },
      numSamples: 3, // Sample multiple times for consistency
    },
  },
];

Programmatic Evaluation

Run evaluation without the AgentEvaluator helper:
import { LocalEvalService } from "@iqai/adk";

const evalService = new LocalEvalService(agent);

// Run inference
const inferenceResults = [];
for await (const result of evalService.performInference({
  evalSetId: evalSet.evalSetId,
  evalCases: [evalSet],
})) {
  inferenceResults.push(result);
}

// Evaluate results
for await (const evalResult of evalService.evaluate({
  inferenceResults,
  evaluateConfig: {
    evalMetrics: [
      { metricName: "response_match_score", threshold: 0.8 },
    ],
  },
})) {
  console.log("Eval result:", evalResult);
}

Migrating Old Test Data

Convert old format test data to the new EvalSet schema:
await AgentEvaluator.migrateEvalDataToNewSchema(
  "./old-tests.json",
  "./new-tests.json",
  "./initial-session.json" // Optional
);

Best Practices

Run multiple iterations - Run each test 2-3 times to account for LLM non-determinism.
Use version control - Track test datasets and results in version control.
Start with prebuilt metrics - Use provided evaluators before building custom ones.
Set realistic thresholds - Don’t expect 100% scores on fuzzy metrics like response matching.
Combine metrics - Use multiple metrics together for comprehensive evaluation.
Large test suites can consume significant API quota. Consider costs when running frequent evaluations.

See Also

Build docs developers (and LLMs) love