Skip to main content
Stagehand provides the V3Evaluator class for testing and evaluating agent task completion. It uses LLMs to analyze screenshots and agent reasoning to determine if tasks were completed successfully.

Overview

The V3Evaluator enables:
  • Automated testing of agent workflows
  • Screenshot-based validation of visual outcomes
  • Reasoning analysis of agent decision-making
  • Batch evaluations for multiple criteria
  • Structured YES/NO responses with detailed reasoning
Location: packages/core/lib/v3Evaluator.ts

Creating an Evaluator

import { Stagehand } from "@browserbasehq/stagehand";
import { V3Evaluator } from "@browserbasehq/stagehand/lib/v3Evaluator";

const stagehand = new Stagehand({ env: "LOCAL" });
await stagehand.init();

// Create evaluator with default model (Gemini 2.5 Flash)
const evaluator = new V3Evaluator(stagehand);

// Or specify a custom model
const evaluator = new V3Evaluator(
  stagehand,
  "anthropic/claude-3-7-sonnet-latest",
  { apiKey: process.env.ANTHROPIC_API_KEY }
);

Basic Evaluation

Screenshot-Based Evaluation

const page = stagehand.context.pages()[0];
await page.goto("https://example.com");

// Perform some actions
await stagehand.act("click the contact button");
await stagehand.act("type John into the name field");

// Evaluate the result
const result = await evaluator.ask({
  question: "Was the contact form filled with the name 'John'?",
  screenshot: true,
});

console.log(result.evaluation); // "YES" or "NO"
console.log(result.reasoning);  // Detailed explanation

Text-Based Evaluation

const result = await evaluator.ask({
  question: "Did the agent successfully log in?",
  answer: "The agent clicked the login button and saw a welcome message",
  screenshot: false,
});

Combined Evaluation

const result = await evaluator.ask({
  question: "Was the order successfully placed?",
  answer: "Order confirmation page displayed",
  screenshot: true, // Both screenshot and text
});

Evaluation Schema

Location: packages/core/lib/v3Evaluator.ts:21-24
import { z } from "zod";

const EvaluationSchema = z.object({
  evaluation: z.enum(["YES", "NO"]),
  reasoning: z.string(),
});

type EvaluationResult = z.infer<typeof EvaluationSchema>;
// {
//   evaluation: "YES" | "NO";
//   reasoning: string;
// }

Advanced Features

Custom System Prompt

const result = await evaluator.ask({
  question: "Is the product price correctly displayed?",
  screenshot: true,
  systemPrompt: `You are an expert e-commerce QA tester.
    Verify that prices are displayed in USD format with 2 decimal places.
    Today's date is ${new Date().toLocaleDateString()}`,
});

Agent Reasoning Evaluation

Evaluate based on the agent’s internal reasoning and actions:
const agent = stagehand.agent({ /* ... */ });

const result = await agent.execute({
  instruction: "Find the cheapest product",
});

// Evaluate using agent's reasoning
const evaluation = await evaluator.ask({
  question: "Did the agent correctly identify the cheapest product?",
  agentReasoning: JSON.stringify(result.actions),
  screenshot: true,
});

Screenshot Delay

const result = await evaluator.ask({
  question: "Did the animation complete?",
  screenshot: true,
  screenshotDelayMs: 1000, // Wait 1s before screenshot
});

Multiple Screenshots

Evaluate task progression across multiple screenshots:
const screenshots: Buffer[] = [];

// Capture screenshots at different stages
await page.goto("https://example.com");
screenshots.push(await page.screenshot());

await stagehand.act("click the next button");
screenshots.push(await page.screenshot());

await stagehand.act("fill out the form");
screenshots.push(await page.screenshot());

// Evaluate all screenshots together
const result = await evaluator.ask({
  question: "Did the user successfully complete the multi-step form?",
  screenshot: screenshots,
});
Implementation:
private async _evaluateWithMultipleScreenshots(options: {
  question: string;
  screenshots: Buffer[];
  systemPrompt?: string;
  agentReasoning?: string;
}): Promise<EvaluationResult> {
  const systemPrompt = options.systemPrompt || 
    `You are an expert evaluator that returns YES or NO given multiple screenshots.
     Analyze ALL screenshots to understand the complete journey.
     Success criteria may appear at different points in the sequence.`;

  const imageContents = screenshots.map((s) => ({
    type: "image_url" as const,
    image_url: { url: `data:image/jpeg;base64,${s.toString("base64")}` },
  }));

  const response = await llmClient.createChatCompletion({
    messages: [
      { role: "system", content: systemPrompt },
      {
        role: "user",
        content: [
          {
            type: "text",
            text: `${question}\n\nI'm providing ${screenshots.length} screenshots showing the progression.`,
          },
          ...imageContents,
        ],
      },
    ],
    response_model: { name: "EvaluationResult", schema: EvaluationSchema },
  });
  // ...
}

Batch Evaluations

Evaluate multiple criteria in a single request:
const results = await evaluator.batchAsk({
  questions: [
    { question: "Is the product title visible?" },
    { question: "Is the price displayed correctly?" },
    { question: "Is the add to cart button enabled?" },
    { question: "Are product reviews shown?" },
  ],
  screenshot: true,
});

results.forEach((result, i) => {
  console.log(`Question ${i + 1}: ${result.evaluation}`);
  console.log(`Reasoning: ${result.reasoning}`);
});
With Custom Answers:
const results = await evaluator.batchAsk({
  questions: [
    {
      question: "Is the user logged in?",
      answer: "The header shows 'Welcome, John'",
    },
    {
      question: "Is the cart empty?",
      answer: "Cart shows 0 items",
    },
  ],
  screenshot: true,
});
Batch Schema:
const BatchEvaluationSchema = z.array(EvaluationSchema);

type BatchEvaluationResult = z.infer<typeof BatchEvaluationSchema>;
// Array<{ evaluation: "YES" | "NO"; reasoning: string }>

LLM Client Integration

The evaluator uses Stagehand’s LLM provider system:
private getClient(): LLMClient {
  const provider = new LLMProvider(this.v3.logger);
  return provider.getClient(this.modelName, this.modelClientOptions);
}
Supported Models:
// Default
const evaluator = new V3Evaluator(stagehand);
// Uses: "google/gemini-2.5-flash"

// Anthropic
const evaluator = new V3Evaluator(
  stagehand,
  "anthropic/claude-3-7-sonnet-latest",
  { apiKey: process.env.ANTHROPIC_API_KEY }
);

// OpenAI
const evaluator = new V3Evaluator(
  stagehand,
  "openai/gpt-4o",
  { apiKey: process.env.OPENAI_API_KEY }
);

// Other providers
const evaluator = new V3Evaluator(
  stagehand,
  "groq/llama-3.3-70b-versatile",
  { apiKey: process.env.GROQ_API_KEY }
);

Evaluation Implementation

Core evaluation method:
async ask(options: EvaluateOptions): Promise<EvaluationResult> {
  const {
    question,
    answer,
    screenshot = true,
    systemPrompt,
    screenshotDelayMs = 250,
    agentReasoning,
  } = options;

  // Validation
  if (!question) throw new StagehandInvalidArgumentError("Question required");
  if (!answer && !screenshot) throw new StagehandInvalidArgumentError("Need answer or screenshot");

  // Handle multiple screenshots
  if (Array.isArray(screenshot)) {
    return this._evaluateWithMultipleScreenshots({
      question,
      screenshots: screenshot,
      systemPrompt,
      agentReasoning,
    });
  }

  const defaultSystemPrompt = `You are an expert evaluator that returns YES or NO.
    You have access to ${screenshot ? "a screenshot" : "the agent's reasoning"}.
    Today's date is ${new Date().toLocaleDateString()}`;

  // Capture screenshot if needed
  await new Promise((r) => setTimeout(r, screenshotDelayMs));
  let imageBuffer: Buffer | undefined;
  if (screenshot) {
    const page = await this.v3.context.awaitActivePage();
    imageBuffer = await page.screenshot({ fullPage: false });
  }

  // Get LLM client and make request
  const llmClient = this.getClient();
  const response = await llmClient.createChatCompletion({
    logger: this.silentLogger,
    options: {
      messages: [
        { role: "system", content: systemPrompt || defaultSystemPrompt },
        {
          role: "user",
          content: [
            {
              type: "text",
              text: agentReasoning
                ? `Question: ${question}\n\nAgent's reasoning:\n${agentReasoning}`
                : question,
            },
            ...(screenshot && imageBuffer ? [{
              type: "image_url" as const,
              image_url: {
                url: `data:image/jpeg;base64,${imageBuffer.toString("base64")}`,
              },
            }] : []),
            ...(answer ? [{ type: "text" as const, text: `the answer is ${answer}` }] : []),
          ],
        },
      ],
      response_model: { name: "EvaluationResult", schema: EvaluationSchema },
    },
  });

  const result = response.data as EvaluationResult;
  return { evaluation: result.evaluation, reasoning: result.reasoning };
}

Testing Workflows

E2E Test Example

import { test, expect } from "@playwright/test";
import { Stagehand } from "@browserbasehq/stagehand";
import { V3Evaluator } from "@browserbasehq/stagehand/lib/v3Evaluator";

test("checkout flow", async () => {
  const stagehand = new Stagehand({ env: "LOCAL" });
  await stagehand.init();
  
  const evaluator = new V3Evaluator(stagehand);
  const page = stagehand.context.pages()[0];

  // Navigate and perform actions
  await page.goto("https://store.example.com");
  await stagehand.act("add the first product to cart");
  await stagehand.act("click the checkout button");
  await stagehand.act("fill the shipping form with test data");

  // Evaluate result
  const result = await evaluator.ask({
    question: "Is the order summary page displayed with correct items?",
    screenshot: true,
  });

  expect(result.evaluation).toBe("YES");
  console.log("Success reason:", result.reasoning);

  await stagehand.close();
});

Agent Task Evaluation

const agent = stagehand.agent({
  mode: "cua",
  model: {
    modelName: "anthropic/claude-sonnet-4-5",
    apiKey: process.env.ANTHROPIC_API_KEY,
  },
});

const page = stagehand.context.pages()[0];
await page.goto("https://example.com");

const result = await agent.execute({
  instruction: "Find the contact page and fill out the form",
  maxSteps: 15,
});

// Evaluate the agent's work
const evaluator = new V3Evaluator(stagehand);
const evaluation = await evaluator.ask({
  question: "Did the agent successfully submit the contact form?",
  screenshot: true,
  agentReasoning: JSON.stringify(result.actions, null, 2),
});

if (evaluation.evaluation === "YES") {
  console.log("✓ Task completed successfully");
  console.log("Reasoning:", evaluation.reasoning);
} else {
  console.error("✗ Task failed");
  console.error("Reason:", evaluation.reasoning);
}

Best Practices

  1. Specific questions: Ask clear, verifiable questions
  2. Use screenshots: Visual validation is more reliable than text-only
  3. Custom system prompts: Provide domain-specific context
  4. Batch when possible: More efficient for multiple criteria
  5. Include agent reasoning: Helps understand agent decision-making
  6. Set appropriate delays: Allow UI updates before screenshots
  7. Multiple screenshots: Capture progression for complex flows
  8. Choose appropriate models: Faster models for simple checks, more capable for complex

Error Handling

try {
  const result = await evaluator.ask({
    question: "Is the page loaded?",
    screenshot: true,
  });
  
  if (result.evaluation === "YES") {
    // Success
  } else {
    console.log("Failed:", result.reasoning);
  }
} catch (error) {
  console.error("Evaluation error:", error);
  // Evaluation may return "INVALID" on LLM errors
}
Invalid response handling:
try {
  const result = response.data as EvaluationResult;
  return { evaluation: result.evaluation, reasoning: result.reasoning };
} catch (error) {
  return {
    evaluation: "INVALID" as const,
    reasoning: `Failed to get structured response: ${error.message}`,
  };
}

Performance Considerations

  1. Model selection: Use fast models (Gemini Flash, GPT-4o-mini) for simple evaluations
  2. Screenshot size: Smaller screenshots = fewer tokens
  3. Batch evaluations: Single request for multiple criteria
  4. Cache evaluations: Store results for repeated tests
  5. Screenshot delay: Balance accuracy vs. speed

References

  • V3Evaluator: packages/core/lib/v3Evaluator.ts
  • LLM Provider: packages/core/lib/v3/llm/LLMProvider.ts
  • Evaluation Types: packages/core/lib/v3/types/private/evaluator.ts

Build docs developers (and LLMs) love