Evaluations with V3Evaluator

Stagehand provides the V3Evaluator class for testing and evaluating agent task completion. It uses LLMs to analyze screenshots and agent reasoning to determine if tasks were completed successfully.

Overview

The V3Evaluator enables:

Automated testing of agent workflows
Screenshot-based validation of visual outcomes
Reasoning analysis of agent decision-making
Batch evaluations for multiple criteria
Structured YES/NO responses with detailed reasoning

Location: packages/core/lib/v3Evaluator.ts

Creating an Evaluator

import { Stagehand } from "@browserbasehq/stagehand";
import { V3Evaluator } from "@browserbasehq/stagehand/lib/v3Evaluator";

const stagehand = new Stagehand({ env: "LOCAL" });
await stagehand.init();

// Create evaluator with default model (Gemini 2.5 Flash)
const evaluator = new V3Evaluator(stagehand);

// Or specify a custom model
const evaluator = new V3Evaluator(
  stagehand,
  "anthropic/claude-3-7-sonnet-latest",
  { apiKey: process.env.ANTHROPIC_API_KEY }
);

Basic Evaluation

Screenshot-Based Evaluation

const page = stagehand.context.pages()[0];
await page.goto("https://example.com");

// Perform some actions
await stagehand.act("click the contact button");
await stagehand.act("type John into the name field");

// Evaluate the result
const result = await evaluator.ask({
  question: "Was the contact form filled with the name 'John'?",
  screenshot: true,
});

console.log(result.evaluation); // "YES" or "NO"
console.log(result.reasoning);  // Detailed explanation

Text-Based Evaluation

const result = await evaluator.ask({
  question: "Did the agent successfully log in?",
  answer: "The agent clicked the login button and saw a welcome message",
  screenshot: false,
});

Combined Evaluation

const result = await evaluator.ask({
  question: "Was the order successfully placed?",
  answer: "Order confirmation page displayed",
  screenshot: true, // Both screenshot and text
});

Evaluation Schema

Location: packages/core/lib/v3Evaluator.ts:21-24

import { z } from "zod";

const EvaluationSchema = z.object({
  evaluation: z.enum(["YES", "NO"]),
  reasoning: z.string(),
});

type EvaluationResult = z.infer<typeof EvaluationSchema>;
// {
//   evaluation: "YES" | "NO";
//   reasoning: string;
// }

Advanced Features

Custom System Prompt

const result = await evaluator.ask({
  question: "Is the product price correctly displayed?",
  screenshot: true,
  systemPrompt: `You are an expert e-commerce QA tester.
    Verify that prices are displayed in USD format with 2 decimal places.
    Today's date is ${new Date().toLocaleDateString()}`,
});

Agent Reasoning Evaluation

Evaluate based on the agent’s internal reasoning and actions:

const agent = stagehand.agent({ /* ... */ });

const result = await agent.execute({
  instruction: "Find the cheapest product",
});

// Evaluate using agent's reasoning
const evaluation = await evaluator.ask({
  question: "Did the agent correctly identify the cheapest product?",
  agentReasoning: JSON.stringify(result.actions),
  screenshot: true,
});

Screenshot Delay

const result = await evaluator.ask({
  question: "Did the animation complete?",
  screenshot: true,
  screenshotDelayMs: 1000, // Wait 1s before screenshot
});

Multiple Screenshots

Evaluate task progression across multiple screenshots:

const screenshots: Buffer[] = [];

// Capture screenshots at different stages
await page.goto("https://example.com");
screenshots.push(await page.screenshot());

await stagehand.act("click the next button");
screenshots.push(await page.screenshot());

await stagehand.act("fill out the form");
screenshots.push(await page.screenshot());

// Evaluate all screenshots together
const result = await evaluator.ask({
  question: "Did the user successfully complete the multi-step form?",
  screenshot: screenshots,
});

Implementation:

private async _evaluateWithMultipleScreenshots(options: {
  question: string;
  screenshots: Buffer[];
  systemPrompt?: string;
  agentReasoning?: string;
}): Promise<EvaluationResult> {
  const systemPrompt = options.systemPrompt || 
    `You are an expert evaluator that returns YES or NO given multiple screenshots.
     Analyze ALL screenshots to understand the complete journey.
     Success criteria may appear at different points in the sequence.`;

  const imageContents = screenshots.map((s) => ({
    type: "image_url" as const,
    image_url: { url: `data:image/jpeg;base64,${s.toString("base64")}` },
  }));

  const response = await llmClient.createChatCompletion({
    messages: [
      { role: "system", content: systemPrompt },
      {
        role: "user",
        content: [
          {
            type: "text",
            text: `${question}\n\nI'm providing ${screenshots.length} screenshots showing the progression.`,
          },
          ...imageContents,
        ],
      },
    ],
    response_model: { name: "EvaluationResult", schema: EvaluationSchema },
  });
  // ...
}

Batch Evaluations

Evaluate multiple criteria in a single request:

const results = await evaluator.batchAsk({
  questions: [
    { question: "Is the product title visible?" },
    { question: "Is the price displayed correctly?" },
    { question: "Is the add to cart button enabled?" },
    { question: "Are product reviews shown?" },
  ],
  screenshot: true,
});

results.forEach((result, i) => {
  console.log(`Question ${i + 1}: ${result.evaluation}`);
  console.log(`Reasoning: ${result.reasoning}`);
});

With Custom Answers:

const results = await evaluator.batchAsk({
  questions: [
    {
      question: "Is the user logged in?",
      answer: "The header shows 'Welcome, John'",
    },
    {
      question: "Is the cart empty?",
      answer: "Cart shows 0 items",
    },
  ],
  screenshot: true,
});

Batch Schema:

const BatchEvaluationSchema = z.array(EvaluationSchema);

type BatchEvaluationResult = z.infer<typeof BatchEvaluationSchema>;
// Array<{ evaluation: "YES" | "NO"; reasoning: string }>

LLM Client Integration

The evaluator uses Stagehand’s LLM provider system:

private getClient(): LLMClient {
  const provider = new LLMProvider(this.v3.logger);
  return provider.getClient(this.modelName, this.modelClientOptions);
}

Supported Models:

// Default
const evaluator = new V3Evaluator(stagehand);
// Uses: "google/gemini-2.5-flash"

// Anthropic
const evaluator = new V3Evaluator(
  stagehand,
  "anthropic/claude-3-7-sonnet-latest",
  { apiKey: process.env.ANTHROPIC_API_KEY }
);

// OpenAI
const evaluator = new V3Evaluator(
  stagehand,
  "openai/gpt-4o",
  { apiKey: process.env.OPENAI_API_KEY }
);

// Other providers
const evaluator = new V3Evaluator(
  stagehand,
  "groq/llama-3.3-70b-versatile",
  { apiKey: process.env.GROQ_API_KEY }
);

Evaluation Implementation

Core evaluation method:

async ask(options: EvaluateOptions): Promise<EvaluationResult> {
  const {
    question,
    answer,
    screenshot = true,
    systemPrompt,
    screenshotDelayMs = 250,
    agentReasoning,
  } = options;

  // Validation
  if (!question) throw new StagehandInvalidArgumentError("Question required");
  if (!answer && !screenshot) throw new StagehandInvalidArgumentError("Need answer or screenshot");

  // Handle multiple screenshots
  if (Array.isArray(screenshot)) {
    return this._evaluateWithMultipleScreenshots({
      question,
      screenshots: screenshot,
      systemPrompt,
      agentReasoning,
    });
  }

  const defaultSystemPrompt = `You are an expert evaluator that returns YES or NO.
    You have access to ${screenshot ? "a screenshot" : "the agent's reasoning"}.
    Today's date is ${new Date().toLocaleDateString()}`;

  // Capture screenshot if needed
  await new Promise((r) => setTimeout(r, screenshotDelayMs));
  let imageBuffer: Buffer | undefined;
  if (screenshot) {
    const page = await this.v3.context.awaitActivePage();
    imageBuffer = await page.screenshot({ fullPage: false });
  }

  // Get LLM client and make request
  const llmClient = this.getClient();
  const response = await llmClient.createChatCompletion({
    logger: this.silentLogger,
    options: {
      messages: [
        { role: "system", content: systemPrompt || defaultSystemPrompt },
        {
          role: "user",
          content: [
            {
              type: "text",
              text: agentReasoning
                ? `Question: ${question}\n\nAgent's reasoning:\n${agentReasoning}`
                : question,
            },
            ...(screenshot && imageBuffer ? [{
              type: "image_url" as const,
              image_url: {
                url: `data:image/jpeg;base64,${imageBuffer.toString("base64")}`,
              },
            }] : []),
            ...(answer ? [{ type: "text" as const, text: `the answer is ${answer}` }] : []),
          ],
        },
      ],
      response_model: { name: "EvaluationResult", schema: EvaluationSchema },
    },
  });

  const result = response.data as EvaluationResult;
  return { evaluation: result.evaluation, reasoning: result.reasoning };
}

Testing Workflows

E2E Test Example

import { test, expect } from "@playwright/test";
import { Stagehand } from "@browserbasehq/stagehand";
import { V3Evaluator } from "@browserbasehq/stagehand/lib/v3Evaluator";

test("checkout flow", async () => {
  const stagehand = new Stagehand({ env: "LOCAL" });
  await stagehand.init();
  
  const evaluator = new V3Evaluator(stagehand);
  const page = stagehand.context.pages()[0];

  // Navigate and perform actions
  await page.goto("https://store.example.com");
  await stagehand.act("add the first product to cart");
  await stagehand.act("click the checkout button");
  await stagehand.act("fill the shipping form with test data");

  // Evaluate result
  const result = await evaluator.ask({
    question: "Is the order summary page displayed with correct items?",
    screenshot: true,
  });

  expect(result.evaluation).toBe("YES");
  console.log("Success reason:", result.reasoning);

  await stagehand.close();
});

Agent Task Evaluation

const agent = stagehand.agent({
  mode: "cua",
  model: {
    modelName: "anthropic/claude-sonnet-4-5",
    apiKey: process.env.ANTHROPIC_API_KEY,
  },
});

const page = stagehand.context.pages()[0];
await page.goto("https://example.com");

const result = await agent.execute({
  instruction: "Find the contact page and fill out the form",
  maxSteps: 15,
});

// Evaluate the agent's work
const evaluator = new V3Evaluator(stagehand);
const evaluation = await evaluator.ask({
  question: "Did the agent successfully submit the contact form?",
  screenshot: true,
  agentReasoning: JSON.stringify(result.actions, null, 2),
});

if (evaluation.evaluation === "YES") {
  console.log("✓ Task completed successfully");
  console.log("Reasoning:", evaluation.reasoning);
} else {
  console.error("✗ Task failed");
  console.error("Reason:", evaluation.reasoning);
}

Best Practices

Specific questions: Ask clear, verifiable questions
Use screenshots: Visual validation is more reliable than text-only
Custom system prompts: Provide domain-specific context
Batch when possible: More efficient for multiple criteria
Include agent reasoning: Helps understand agent decision-making
Set appropriate delays: Allow UI updates before screenshots
Multiple screenshots: Capture progression for complex flows
Choose appropriate models: Faster models for simple checks, more capable for complex

Error Handling

try {
  const result = await evaluator.ask({
    question: "Is the page loaded?",
    screenshot: true,
  });
  
  if (result.evaluation === "YES") {
    // Success
  } else {
    console.log("Failed:", result.reasoning);
  }
} catch (error) {
  console.error("Evaluation error:", error);
  // Evaluation may return "INVALID" on LLM errors
}

Invalid response handling:

try {
  const result = response.data as EvaluationResult;
  return { evaluation: result.evaluation, reasoning: result.reasoning };
} catch (error) {
  return {
    evaluation: "INVALID" as const,
    reasoning: `Failed to get structured response: ${error.message}`,
  };
}

Performance Considerations

Model selection: Use fast models (Gemini Flash, GPT-4o-mini) for simple evaluations
Screenshot size: Smaller screenshots = fewer tokens
Batch evaluations: Single request for multiple criteria
Cache evaluations: Store results for repeated tests
Screenshot delay: Balance accuracy vs. speed

References

V3Evaluator: packages/core/lib/v3Evaluator.ts
LLM Provider: packages/core/lib/v3/llm/LLMProvider.ts
Evaluation Types: packages/core/lib/v3/types/private/evaluator.ts

Getting Started

Core Concepts

Core Methods

Configuration

Integrations

Best Practices

Advanced Features

Evaluations with V3Evaluator

Overview

Creating an Evaluator

Basic Evaluation

Screenshot-Based Evaluation

Text-Based Evaluation

Combined Evaluation

Evaluation Schema

Advanced Features

Custom System Prompt

Agent Reasoning Evaluation

Screenshot Delay

Multiple Screenshots

Batch Evaluations

LLM Client Integration

Evaluation Implementation

Testing Workflows

E2E Test Example

Agent Task Evaluation

Best Practices

Error Handling

Performance Considerations

References

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Core Methods

Configuration

Integrations

Best Practices

Advanced Features

​Overview

​Creating an Evaluator

​Basic Evaluation

​Screenshot-Based Evaluation

​Text-Based Evaluation

​Combined Evaluation

​Evaluation Schema

​Advanced Features

​Custom System Prompt

​Agent Reasoning Evaluation

​Screenshot Delay

​Multiple Screenshots

​Batch Evaluations

​LLM Client Integration

​Evaluation Implementation

​Testing Workflows

​E2E Test Example

​Agent Task Evaluation

​Best Practices

​Error Handling

​Performance Considerations

​References

Build docs developers (and LLMs) love

Overview

Creating an Evaluator

Basic Evaluation

Screenshot-Based Evaluation

Text-Based Evaluation

Combined Evaluation

Evaluation Schema

Advanced Features

Custom System Prompt

Agent Reasoning Evaluation

Screenshot Delay

Multiple Screenshots

Batch Evaluations

LLM Client Integration

Evaluation Implementation

Testing Workflows

E2E Test Example

Agent Task Evaluation

Best Practices

Error Handling

Performance Considerations

References