Skip to main content

Overview

The V3Evaluator class provides AI-powered evaluation of whether goals have been achieved. It can analyze screenshots, page text, or agent reasoning to determine if a task was completed successfully.

Constructor

import { V3Evaluator } from "@browserbasehq/stagehand";

const evaluator = new V3Evaluator(
  stagehand,
  modelName?,
  modelClientOptions?
);
stagehand
Stagehand
required
Stagehand instance to evaluate
modelName
AvailableModel
Model to use for evaluationDefault: "google/gemini-2.5-flash"
modelClientOptions
ClientOptions
Client options for the evaluation model

Methods

ask()

Evaluate whether a goal was achieved.
const result = await evaluator.ask(options);
options
EvaluateOptions
required
returns
Promise<EvaluationResult>

batchAsk()

Evaluate multiple questions at once.
const results = await evaluator.batchAsk(options);
options
BatchAskOptions
required
returns
Promise<EvaluationResult[]>
Array of evaluation results, one per question

Examples

Basic Evaluation

import { Stagehand, V3Evaluator } from "@browserbasehq/stagehand";

const stagehand = new Stagehand({ env: "LOCAL" });
await stagehand.init();

const page = await stagehand.context.newPage();
await page.goto("https://example.com");

// Perform actions
await stagehand.act("click the login button");
await stagehand.act("type '[email protected]' in email field");
await stagehand.act("type 'password123' in password field");
await stagehand.act("click submit");

// Evaluate result
const evaluator = new V3Evaluator(stagehand);
const result = await evaluator.ask({
  question: "Is the user successfully logged in?",
  screenshot: true,
});

if (result.evaluation === "YES") {
  console.log("Login successful!");
  console.log("Reasoning:", result.reasoning);
} else {
  console.log("Login failed.");
  console.log("Reasoning:", result.reasoning);
}

await stagehand.close();

Evaluate with Expected Answer

const evaluator = new V3Evaluator(stagehand);

const result = await evaluator.ask({
  question: "What is the page title?",
  answer: "Welcome to Example Site",
  screenshot: true,
});

if (result.evaluation === "YES") {
  console.log("Title matches expected value");
} else {
  console.log("Title does not match:", result.reasoning);
}

Evaluate Agent Execution

const agent = stagehand.agent();
const agentResult = await agent.execute(
  "Book a flight from NYC to LA"
);

// Format agent actions for evaluation
const agentReasoning = agentResult.actions
  .map((a, i) => `${i + 1}. ${a.type}: ${a.reasoning || a.action}`)
  .join("\n");

const evaluator = new V3Evaluator(stagehand);
const result = await evaluator.ask({
  question: "Did the agent successfully book a flight?",
  agentReasoning,
  screenshot: true,
});

console.log("Success:", result.evaluation === "YES");
console.log("Reasoning:", result.reasoning);

Batch Evaluation

const evaluator = new V3Evaluator(stagehand);

const results = await evaluator.batchAsk({
  questions: [
    { question: "Is the search bar visible?" },
    { question: "Is the user logged in?" },
    { question: "Are there any error messages?" },
    { 
      question: "What is the product price?",
      answer: "$29.99"
    },
  ],
  screenshot: true,
});

results.forEach((result, i) => {
  console.log(`\nQuestion ${i + 1}:`);
  console.log(`Evaluation: ${result.evaluation}`);
  console.log(`Reasoning: ${result.reasoning}`);
});

Custom Model

// Use a different model for evaluation
const evaluator = new V3Evaluator(
  stagehand,
  "anthropic/claude-3-5-sonnet-latest",
  {
    apiKey: process.env.ANTHROPIC_API_KEY,
    temperature: 0.3,
  }
);

const result = await evaluator.ask({
  question: "Is the checkout process complete?",
});

Evaluate Without Screenshot

const evaluator = new V3Evaluator(stagehand);

// Evaluate based on agent reasoning only
const result = await evaluator.ask({
  question: "Did the task complete successfully?",
  screenshot: false,
  agentReasoning: `
    Step 1: Navigated to homepage
    Step 2: Clicked login button
    Step 3: Filled username and password
    Step 4: Submitted form
    Step 5: Redirected to dashboard
  `,
});

Multiple Screenshots

import fs from "fs";

const screenshot1 = fs.readFileSync("./before.png");
const screenshot2 = fs.readFileSync("./after.png");

const evaluator = new V3Evaluator(stagehand);

const result = await evaluator.ask({
  question: "Did the page content change as expected?",
  screenshot: [screenshot1, screenshot2],
});

Testing Workflow

import { test, expect } from "@playwright/test";
import { Stagehand, V3Evaluator } from "@browserbasehq/stagehand";

test("user can complete checkout", async () => {
  const stagehand = new Stagehand({ env: "LOCAL" });
  await stagehand.init();
  
  const page = await stagehand.context.newPage();
  await page.goto("https://example-shop.com");
  
  // Perform checkout steps
  await stagehand.act("add first item to cart");
  await stagehand.act("go to checkout");
  await stagehand.act("fill in shipping information");
  await stagehand.act("submit order");
  
  // Evaluate success
  const evaluator = new V3Evaluator(stagehand);
  const result = await evaluator.ask({
    question: "Was the order successfully placed?",
    screenshot: true,
  });
  
  expect(result.evaluation).toBe("YES");
  
  await stagehand.close();
});

Custom System Prompt

const evaluator = new V3Evaluator(stagehand);

const result = await evaluator.ask({
  question: "Is the dashboard showing the correct data?",
  systemPrompt: `You are an expert QA engineer evaluating UI tests.
    Be strict about validation and look for any inconsistencies.
    Check that all expected elements are present and correct.
    Today's date is ${new Date().toLocaleDateString()}`,
  screenshot: true,
});

Best Practices

  1. Use specific questions:
    // Good
    ask({ question: "Is the user logged in with username '[email protected]'?" })
    
    // Less specific
    ask({ question: "Is login successful?" })
    
  2. Provide expected answers for validation:
    ask({
      question: "What is the total price?",
      answer: "$149.99",
    })
    
  3. Include screenshots for visual verification:
    ask({ question: "Is the modal visible?", screenshot: true })
    
  4. Use batch evaluation for efficiency:
    // More efficient than multiple individual calls
    batchAsk({
      questions: [
        { question: "Is element A visible?" },
        { question: "Is element B visible?" },
        { question: "Is element C visible?" },
      ],
    })
    
  5. Add delays for dynamic content:
    ask({
      question: "Did the animation complete?",
      screenshotDelayMs: 1000, // Wait for animation
    })
    
  6. Check evaluation result:
    const result = await evaluator.ask({ question: "..." });
    
    if (result.evaluation === "INVALID") {
      console.error("Evaluation failed:", result.reasoning);
      // Handle evaluation failure
    }
    

Build docs developers (and LLMs) love