V3Evaluator Class

Overview

The V3Evaluator class provides AI-powered evaluation of whether goals have been achieved. It can analyze screenshots, page text, or agent reasoning to determine if a task was completed successfully.

Constructor

import { V3Evaluator } from "@browserbasehq/stagehand";

const evaluator = new V3Evaluator(
  stagehand,
  modelName?,
  modelClientOptions?
);

stagehand

Stagehand

required

Stagehand instance to evaluate

modelName

AvailableModel

Model to use for evaluationDefault: "google/gemini-2.5-flash"

modelClientOptions

ClientOptions

Client options for the evaluation model

Methods

ask()

Evaluate whether a goal was achieved.

const result = await evaluator.ask(options);

options

EvaluateOptions

required

Show properties

question

string

required

Question to evaluate (e.g., “Is the user logged in?”)

answer

string

Expected answer or context to verify

screenshot

boolean | Buffer[]

Whether to include screenshot(s) in evaluation

true: Take a screenshot of the active page
false: Use only text/reasoning
Buffer[]: Use specific screenshots

Default: true

systemPrompt

string

Custom system prompt for the evaluator

screenshotDelayMs

number

Delay before taking screenshot (ms)Default: 250

agentReasoning

string

Agent’s reasoning and actions to evaluate

returns

Promise<EvaluationResult>

Show EvaluationResult properties

evaluation

'YES' | 'NO' | 'INVALID'

required

Evaluation result:

"YES": Goal achieved
"NO": Goal not achieved
"INVALID": Unable to evaluate

reasoning

string

required

Detailed reasoning for the evaluation

batchAsk()

Evaluate multiple questions at once.

const results = await evaluator.batchAsk(options);

options

BatchAskOptions

required

Show properties

questions

array

required

Array of questions to evaluate

{
  question: string;
  answer?: string;
}[]

screenshot

boolean

Include screenshot for all questionsDefault: true

systemPrompt

string

Custom system prompt

screenshotDelayMs

number

Screenshot delay in millisecondsDefault: 250

returns

Promise<EvaluationResult[]>

Array of evaluation results, one per question

Examples

Basic Evaluation

import { Stagehand, V3Evaluator } from "@browserbasehq/stagehand";

const stagehand = new Stagehand({ env: "LOCAL" });
await stagehand.init();

const page = await stagehand.context.newPage();
await page.goto("https://example.com");

// Perform actions
await stagehand.act("click the login button");
await stagehand.act("type '[email protected]' in email field");
await stagehand.act("type 'password123' in password field");
await stagehand.act("click submit");

// Evaluate result
const evaluator = new V3Evaluator(stagehand);
const result = await evaluator.ask({
  question: "Is the user successfully logged in?",
  screenshot: true,
});

if (result.evaluation === "YES") {
  console.log("Login successful!");
  console.log("Reasoning:", result.reasoning);
} else {
  console.log("Login failed.");
  console.log("Reasoning:", result.reasoning);
}

await stagehand.close();

Evaluate with Expected Answer

const evaluator = new V3Evaluator(stagehand);

const result = await evaluator.ask({
  question: "What is the page title?",
  answer: "Welcome to Example Site",
  screenshot: true,
});

if (result.evaluation === "YES") {
  console.log("Title matches expected value");
} else {
  console.log("Title does not match:", result.reasoning);
}

Evaluate Agent Execution

const agent = stagehand.agent();
const agentResult = await agent.execute(
  "Book a flight from NYC to LA"
);

// Format agent actions for evaluation
const agentReasoning = agentResult.actions
  .map((a, i) => `${i + 1}. ${a.type}: ${a.reasoning || a.action}`)
  .join("\n");

const evaluator = new V3Evaluator(stagehand);
const result = await evaluator.ask({
  question: "Did the agent successfully book a flight?",
  agentReasoning,
  screenshot: true,
});

console.log("Success:", result.evaluation === "YES");
console.log("Reasoning:", result.reasoning);

Batch Evaluation

const evaluator = new V3Evaluator(stagehand);

const results = await evaluator.batchAsk({
  questions: [
    { question: "Is the search bar visible?" },
    { question: "Is the user logged in?" },
    { question: "Are there any error messages?" },
    { 
      question: "What is the product price?",
      answer: "$29.99"
    },
  ],
  screenshot: true,
});

results.forEach((result, i) => {
  console.log(`\nQuestion ${i + 1}:`);
  console.log(`Evaluation: ${result.evaluation}`);
  console.log(`Reasoning: ${result.reasoning}`);
});

Custom Model

// Use a different model for evaluation
const evaluator = new V3Evaluator(
  stagehand,
  "anthropic/claude-3-5-sonnet-latest",
  {
    apiKey: process.env.ANTHROPIC_API_KEY,
    temperature: 0.3,
  }
);

const result = await evaluator.ask({
  question: "Is the checkout process complete?",
});

Evaluate Without Screenshot

const evaluator = new V3Evaluator(stagehand);

// Evaluate based on agent reasoning only
const result = await evaluator.ask({
  question: "Did the task complete successfully?",
  screenshot: false,
  agentReasoning: `
    Step 1: Navigated to homepage
    Step 2: Clicked login button
    Step 3: Filled username and password
    Step 4: Submitted form
    Step 5: Redirected to dashboard
  `,
});

Multiple Screenshots

import fs from "fs";

const screenshot1 = fs.readFileSync("./before.png");
const screenshot2 = fs.readFileSync("./after.png");

const evaluator = new V3Evaluator(stagehand);

const result = await evaluator.ask({
  question: "Did the page content change as expected?",
  screenshot: [screenshot1, screenshot2],
});

Testing Workflow

import { test, expect } from "@playwright/test";
import { Stagehand, V3Evaluator } from "@browserbasehq/stagehand";

test("user can complete checkout", async () => {
  const stagehand = new Stagehand({ env: "LOCAL" });
  await stagehand.init();
  
  const page = await stagehand.context.newPage();
  await page.goto("https://example-shop.com");
  
  // Perform checkout steps
  await stagehand.act("add first item to cart");
  await stagehand.act("go to checkout");
  await stagehand.act("fill in shipping information");
  await stagehand.act("submit order");
  
  // Evaluate success
  const evaluator = new V3Evaluator(stagehand);
  const result = await evaluator.ask({
    question: "Was the order successfully placed?",
    screenshot: true,
  });
  
  expect(result.evaluation).toBe("YES");
  
  await stagehand.close();
});

Custom System Prompt

const evaluator = new V3Evaluator(stagehand);

const result = await evaluator.ask({
  question: "Is the dashboard showing the correct data?",
  systemPrompt: `You are an expert QA engineer evaluating UI tests.
    Be strict about validation and look for any inconsistencies.
    Check that all expected elements are present and correct.
    Today's date is ${new Date().toLocaleDateString()}`,
  screenshot: true,
});

Best Practices

Use specific questions:

// Good
ask({ question: "Is the user logged in with username '[email protected]'?" })

// Less specific
ask({ question: "Is login successful?" })

Provide expected answers for validation:

ask({
  question: "What is the total price?",
  answer: "$149.99",
})

Include screenshots for visual verification:

ask({ question: "Is the modal visible?", screenshot: true })

Use batch evaluation for efficiency:

// More efficient than multiple individual calls
batchAsk({
  questions: [
    { question: "Is element A visible?" },
    { question: "Is element B visible?" },
    { question: "Is element C visible?" },
  ],
})

Add delays for dynamic content:

ask({
  question: "Did the animation complete?",
  screenshotDelayMs: 1000, // Wait for animation
})

Check evaluation result:

const result = await evaluator.ask({ question: "..." });

if (result.evaluation === "INVALID") {
  console.error("Evaluation failed:", result.reasoning);
  // Handle evaluation failure
}

Core Classes

Methods

Types & Schemas

Utilities

Overview

Constructor

Methods

ask()

batchAsk()

Examples

Basic Evaluation

Evaluate with Expected Answer

Evaluate Agent Execution

Batch Evaluation

Custom Model

Evaluate Without Screenshot

Multiple Screenshots

Testing Workflow

Custom System Prompt

Best Practices

Build docs developers (and LLMs) love

Core Classes

Methods

Types & Schemas

Utilities

​Overview

​Constructor

​Methods

​ask()

​batchAsk()

​Examples

​Basic Evaluation

​Evaluate with Expected Answer

​Evaluate Agent Execution

​Batch Evaluation

​Custom Model

​Evaluate Without Screenshot

​Multiple Screenshots

​Testing Workflow

​Custom System Prompt

​Best Practices

Build docs developers (and LLMs) love

Overview

Constructor

Methods

ask()

batchAsk()

Examples

Basic Evaluation

Evaluate with Expected Answer

Evaluate Agent Execution

Batch Evaluation

Custom Model

Evaluate Without Screenshot

Multiple Screenshots

Testing Workflow

Custom System Prompt

Best Practices