The ADK evaluation system helps you measure and improve agent performance through structured testing, prebuilt metrics, and custom evaluators.
Overview
The evaluation system provides:
- Evaluation datasets - Structured test cases with expected outputs
- Prebuilt metrics - Ready-to-use evaluators for common tasks
- Custom evaluators - Build domain-specific evaluation logic
- Batch evaluation - Run tests across multiple cases
- Detailed reports - Per-invocation and aggregate results
Quick Start
import { AgentEvaluator, PrebuiltMetrics } from "@iqai/adk";
const agent = new AgentBuilder()
.withModel("gpt-4")
.withTools([searchTool])
.buildLlm();
// Define evaluation criteria
const criteria = {
[PrebuiltMetrics.TOOL_TRAJECTORY_AVG_SCORE]: 1.0,
[PrebuiltMetrics.RESPONSE_MATCH_SCORE]: 0.8,
};
// Run evaluation
await AgentEvaluator.evaluate(
agent,
"./test-data.json", // Path to test file or directory
2 // Number of runs per test case
);
EvalSet
An EvalSet is a collection of test cases:
interface EvalSet {
evalSetId: string;
name?: string;
description?: string;
evalCases: EvalCase[];
creationTimestamp: number;
}
EvalCase
Each test case contains a conversation with expected behavior:
interface EvalCase {
evalId: string;
conversation: Invocation[];
sessionInput?: SessionInput; // Initial session state
}
Invocation
An invocation represents a single user message and expected response:
interface Invocation {
invocationId?: string;
userContent: Content; // User message
finalResponse?: Content; // Expected response
intermediateData?: IntermediateData; // Expected tool calls
creationTimestamp: number;
}
Example Dataset
{
"evalSetId": "search-agent-v1",
"name": "Search Agent Tests",
"evalCases": [
{
"evalId": "test-1",
"conversation": [
{
"invocationId": "inv-1",
"userContent": {
"role": "user",
"parts": [{ "text": "Search for the latest AI news" }]
},
"finalResponse": {
"role": "model",
"parts": [{ "text": "Here are the latest AI developments..." }]
},
"intermediateData": {
"toolUses": [
{
"name": "search",
"args": { "query": "latest AI news" }
}
],
"intermediateResponses": []
},
"creationTimestamp": 1234567890
}
]
}
],
"creationTimestamp": 1234567890
}
Prebuilt Metrics
ADK includes several ready-to-use evaluation metrics:
Evaluates if the agent used the expected tools:
const criteria = {
[PrebuiltMetrics.TOOL_TRAJECTORY_AVG_SCORE]: 1.0, // Perfect match required
};
Measures: Precision and recall of tool calls (function names only)
RESPONSE_MATCH_SCORE
Compares agent response to reference using ROUGE metrics:
const criteria = {
[PrebuiltMetrics.RESPONSE_MATCH_SCORE]: 0.8, // 80% similarity
};
Measures: Text similarity using ROUGE-1, ROUGE-2, and ROUGE-L
FINAL_RESPONSE_MATCH_V2
Advanced response matching with configurable scoring:
const criteria = {
[PrebuiltMetrics.FINAL_RESPONSE_MATCH_V2]: 0.75,
};
Measures: Combines multiple similarity metrics with customizable weights
SAFETY_V1
Evaluates response safety using LLM-as-judge:
const criteria = {
[PrebuiltMetrics.SAFETY_V1]: 1.0, // All responses must be safe
};
Measures: Content safety across multiple dimensions (harm, bias, etc.)
AgentEvaluator
The main interface for running evaluations:
evaluate()
Run evaluation on test files:
static async evaluate(
agent: BaseAgent,
evalDatasetFilePathOrDir: string,
numRuns: number = 2,
initialSessionFile?: string
): Promise<void>
Example:
await AgentEvaluator.evaluate(
agent,
"./tests", // Finds all *.test.json files
3, // Run each test 3 times
"./initial-session.json" // Optional initial state
);
evaluateEvalSet()
Evaluate a specific eval set:
static async evaluateEvalSet(
agent: BaseAgent,
evalSet: EvalSet,
criteria: Record<string, number>,
numRuns: number = 2,
printDetailedResults: boolean = false
): Promise<void>
Example:
const evalSet: EvalSet = {
evalSetId: "my-tests",
evalCases: [...],
creationTimestamp: Date.now(),
};
await AgentEvaluator.evaluateEvalSet(
agent,
evalSet,
{
[PrebuiltMetrics.TOOL_TRAJECTORY_AVG_SCORE]: 1.0,
[PrebuiltMetrics.RESPONSE_MATCH_SCORE]: 0.75,
},
2, // Number of runs
true // Print detailed results
);
Test Configuration
Create a test_config.json file alongside your test data:
{
"criteria": {
"tool_trajectory_avg_score": 1.0,
"response_match_score": 0.8,
"safety_v1": 1.0
}
}
The evaluator automatically finds and uses this config.
Custom Evaluators
Build custom evaluation logic by extending the Evaluator class:
Define your metric
import { Evaluator, EvaluationResult, Invocation } from "@iqai/adk";
export class CustomEvaluator extends Evaluator {
constructor(threshold: number = 0.8) {
super({
metricName: "custom_metric",
threshold,
});
}
}
Implement evaluation logic
async evaluateInvocations(
actualInvocations: Invocation[],
expectedInvocations: Invocation[]
): Promise<EvaluationResult> {
const perInvocationResults = [];
for (let i = 0; i < actualInvocations.length; i++) {
const actual = actualInvocations[i];
const expected = expectedInvocations[i];
// Your evaluation logic here
const score = this.calculateScore(actual, expected);
const passed = score >= this.metric.threshold;
perInvocationResults.push({
actualInvocation: actual,
expectedInvocation: expected,
score,
evalStatus: passed ? EvalStatus.PASSED : EvalStatus.FAILED,
});
}
const overallScore = perInvocationResults.reduce(
(sum, r) => sum + (r.score || 0), 0
) / perInvocationResults.length;
return {
overallScore,
overallEvalStatus: overallScore >= this.metric.threshold
? EvalStatus.PASSED
: EvalStatus.FAILED,
perInvocationResults,
};
}
Add metric metadata
static getMetricInfo(): MetricInfo {
return {
metricName: "custom_metric",
description: "Evaluates custom business logic",
defaultThreshold: 0.8,
experimental: false,
metricValueInfo: {
interval: {
minValue: 0,
openAtMin: false,
maxValue: 1,
openAtMax: false,
},
},
};
}
Example: Intent Matching Evaluator
import { Evaluator, EvalStatus } from "@iqai/adk";
import type { Invocation, EvaluationResult } from "@iqai/adk";
export class IntentMatchEvaluator extends Evaluator {
constructor(threshold: number = 1.0) {
super({ metricName: "intent_match", threshold });
}
async evaluateInvocations(
actualInvocations: Invocation[],
expectedInvocations: Invocation[]
): Promise<EvaluationResult> {
const results = [];
for (let i = 0; i < actualInvocations.length; i++) {
const actual = actualInvocations[i];
const expected = expectedInvocations[i];
// Extract intents (simplified)
const actualIntent = this.extractIntent(actual);
const expectedIntent = this.extractIntent(expected);
const score = actualIntent === expectedIntent ? 1.0 : 0.0;
results.push({
actualInvocation: actual,
expectedInvocation: expected,
score,
evalStatus: score >= this.metric.threshold
? EvalStatus.PASSED
: EvalStatus.FAILED,
});
}
const avgScore = results.reduce((sum, r) => sum + r.score!, 0) / results.length;
return {
overallScore: avgScore,
overallEvalStatus: avgScore >= this.metric.threshold
? EvalStatus.PASSED
: EvalStatus.FAILED,
perInvocationResults: results,
};
}
private extractIntent(invocation: Invocation): string {
// Implement your intent extraction logic
const text = invocation.userContent.parts?.[0]?.text || "";
// Simple keyword-based intent detection
if (text.includes("search")) return "search";
if (text.includes("book")) return "booking";
return "unknown";
}
static getMetricInfo() {
return {
metricName: "intent_match",
description: "Matches user intent against expected intent",
defaultThreshold: 1.0,
};
}
}
Prebuilt Evaluators
ADK provides several evaluator implementations:
TrajectoryEvaluator
Compares tool usage between actual and expected:
import { TrajectoryEvaluator } from "@iqai/adk";
const evaluator = new TrajectoryEvaluator(1.0); // Threshold
const result = await evaluator.evaluateInvocations(
actualInvocations,
expectedInvocations
);
RougeEvaluator
Compares text using ROUGE metrics:
import { RougeEvaluator } from "@iqai/adk";
const evaluator = new RougeEvaluator(0.8); // 80% similarity
const result = await evaluator.evaluateInvocations(
actualInvocations,
expectedInvocations
);
FinalResponseMatchV2Evaluator
Advanced response matching:
import { FinalResponseMatchV2Evaluator } from "@iqai/adk";
const evaluator = new FinalResponseMatchV2Evaluator(0.75);
const result = await evaluator.evaluateInvocations(
actualInvocations,
expectedInvocations
);
SafetyEvaluatorV1
LLM-based safety evaluation:
import { SafetyEvaluatorV1 } from "@iqai/adk";
const evaluator = new SafetyEvaluatorV1(1.0, {
judgeModel: "gpt-4",
numSamples: 1,
});
const result = await evaluator.evaluateInvocations(
actualInvocations,
expectedInvocations
);
Evaluation Results
EvalResult
Results include per-case and aggregate metrics:
interface EvalCaseResult {
evalId: string;
evalMetricResultPerInvocation: EvalMetricResultPerInvocation[];
}
interface EvalMetricResultPerInvocation {
actualInvocation: Invocation;
expectedInvocation: Invocation;
evalMetricResults: EvalMetricResult[];
}
interface EvalMetricResult {
metricName: string;
threshold: number;
score?: number;
evalStatus: EvalStatus;
}
Reading Results
try {
await AgentEvaluator.evaluate(agent, "./tests", 2);
console.log("All tests passed!");
} catch (error) {
console.error("Test failures:", error.message);
// Re-run with detailed output
await AgentEvaluator.evaluateEvalSet(
agent,
evalSet,
criteria,
1,
true // Print detailed results
);
}
Advanced Usage
Custom Judge Models
Use different models for LLM-as-judge metrics:
const criteria = {
[PrebuiltMetrics.SAFETY_V1]: 1.0,
};
const evalMetrics = [
{
metricName: PrebuiltMetrics.SAFETY_V1,
threshold: 1.0,
judgeModelOptions: {
judgeModel: "claude-3-opus",
judgeModelConfig: {
temperature: 0.0,
},
numSamples: 3, // Sample multiple times for consistency
},
},
];
Programmatic Evaluation
Run evaluation without the AgentEvaluator helper:
import { LocalEvalService } from "@iqai/adk";
const evalService = new LocalEvalService(agent);
// Run inference
const inferenceResults = [];
for await (const result of evalService.performInference({
evalSetId: evalSet.evalSetId,
evalCases: [evalSet],
})) {
inferenceResults.push(result);
}
// Evaluate results
for await (const evalResult of evalService.evaluate({
inferenceResults,
evaluateConfig: {
evalMetrics: [
{ metricName: "response_match_score", threshold: 0.8 },
],
},
})) {
console.log("Eval result:", evalResult);
}
Migrating Old Test Data
Convert old format test data to the new EvalSet schema:
await AgentEvaluator.migrateEvalDataToNewSchema(
"./old-tests.json",
"./new-tests.json",
"./initial-session.json" // Optional
);
Best Practices
Run multiple iterations - Run each test 2-3 times to account for LLM non-determinism.
Use version control - Track test datasets and results in version control.
Start with prebuilt metrics - Use provided evaluators before building custom ones.
Set realistic thresholds - Don’t expect 100% scores on fuzzy metrics like response matching.
Combine metrics - Use multiple metrics together for comprehensive evaluation.
Large test suites can consume significant API quota. Consider costs when running frequent evaluations.
See Also